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Foreword 


Machine learning is considered by many to be the future of statistics and computer en- 
gineering as it reshapes customer Service, design, banking, medicine, manufacturing, and 
hosts of other disciplines and Industries. It is hard to overstate its impact on the world so 
far and the changes it will bring about in the coming years and decades. Of the multi- 
tude of machine learning methods applied by professionals, such as penalized regression, 
random forests, and boosted trees, perhaps the most excitement-mducing is deep learning. 

Deep learning has revolutionized computer vision and natural language processing, 
and researchers are stili fmding new areas to transform with the power of neural networks. 
Its most profound impact is often seen in efforts to replicate the human experience, 
such as the aforementioned vision and language processing, and also audio synthesis 
and translations. The math and concepts underlying deep learning can seem daunting, 
unnecessarily deterring people from getting started. 

The authors of Deep Learning Illustrated challenge the traditionally perceived barriers 
and impart their knowledge with ease and levity, resulting in a book that is enjoyable to 
read. Much like the other books in this series— R for Everyone, Pandas for Everyone, Pro- 
gramming Skillsfor Data Science, and Machine Learning with Python for Everyone —this book 
is welcoming and accessible to a broad audience from myriad backgrounds. Mathematical 
notation is kept to a minimum and, when needed, the equations are presented alongside 
understandable prose. The majority ofinsights are augmented with visuals, illustrations, 
and Keras code, which is also available as easy-to-follow Jupyter notebooks. 

Jon Krohn has spent many years teaching deep learning, including a particularly mem- 
orable presentation at the New York Open Statistical Programming Meetup—the same 
community from which he launched his Deep Learning Study Group. His mastery of the 
subject shines through in his writing, giving readers ample education while at the same 
time inviting them to be excited about the material. He is joined by Grant Beyleveld and 
Aglae Bassens who add their expertise in applying deep learning algorithms and skillful 
drawings. 

Deep Learning Illustrated combines theory, math where needed, code, and visualizations 
for a comprehensive treatment of deep learning. It covers the full breadth of the subject, 
including densely connected networks, convolutional neural nets, recurrent neural nets, 
generative adversarial networks, and reinforcement learning, and their applications. This 
makes the book the ideal choice for someone who wants to learn about neural networks 
with practical guidance for implementing them. Anyone can, and should, benefit from, as 
well as enjoy, their time spent reading along with Jon, Grant, and Aglae. 

—-Jared Lander 
Series Editor 
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Preface 


Commonly called brain cells, billions of interconnected neurons make up your nervous 
system, and they enable you to sense, to think, and to take action. By meticulously 
staining and examining thin slices of brain tissue, the Spanish physician Santiago Cajal 
(Figure P.l), was the first 1 to identify neurons (Figure P.2), and in the early half ofthe 
twentieth century, researchers began to shed light on how these biological cells work. By 
the 1950s, scientists inspired by our developing understanding ofthe brain were exper- 
imenting with computer-based artificia1 neurons, linking these together to form artificial 
neural networks that loosely mimic the operation of their natural namesake. 

Armed with this briefhistory of neurons, we can defme the term deep learning decep- 
tively straightforwardly: Deep learning involves a network in which artificial neurons— 
typically thousands, millions, or many more of them—are stacked at least several layers 
deep. The artificial neurons in the first layer pass information to the second, the second 
to the third, and so on, until the fmal layer outputs some values. That said, as we literally 
illustrate throughout this book, this simple definition does not satisfactorily capture deep 
learnings remarkable breadth of functionality nor its extraordinary nuance. 

As we detail in Chapter 1, with the advent of sufficiently inexpensive computing 
power, sufficiently large datasets, and a handful of landmark theoretical advances, the first 
wave of the deep learning tsunami to hit the proverbial shore was a standout performance 
in a leading machine vision competition in 2012. Academics and technologists took note, 
and in the action-packed years since, deep learning has facilitated countless now-everyday 



1. Cajal, S.-R. (1894). Les Nouvelles Idees sur la Structure du Systeme Nerveux chez VHomme et chez les Vertebres. Paris: 
C. Reinwald & Companie. 
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Figure P.2 A hand-drawn diagram from Cajal’s (1894) publication showing the growth 
of a neuron (a-e) and contrasting neurons from frog (A), lizard (B), rat (C), and human 

(D) samples 


applications. From Teslas Autopilot to the voice recognition of Amazons Alexa, from 
real-time translation between languages to its integration in hundreds of Google products, 
deep learning has improved the accuracy of a great number of computational tasks from 
95 percent to 99 percent or better—the tricky few percent that can make an automated 
Service feel as though it works by magic. Although the concrete, interactive code exam- 
ples throughout this book will dispel this apparent wizardry, deep learning has indeed 
imbued machines with superhuman capability on complex tasks as diverse as face recogni¬ 
tion, text summarization, and elaborate board games. 2 Given these prominent advances, it 
is unsurprising that “deep learning” has become synonymous with “artificial intelligence” 
in the popular press, the workplace, and the horne. 

These are exciting times, because, as you’ll discover over the course of this book, per- 
haps only once in a lifetime does a single concept disrupt so widely in such a short period 
of time. We are delighted that you too have developed an interest in deep learning and 
we can’t wait to share our enthusiasm for this unprecedentedly transformative technique 
with you. 

How to Read This Book 


This book is split into four parts. Part I, “Introducing Deep Learning,” is well suited to 
any interested reader. This part serves as a high-level overview that establishes what deep 
learning is, how it evolved to be ubiquitous, and how it is related to concepts like AI, 
machine learning, and reinforcement learning. Replete with vivid bespoke illustrations, 
straightforward analogies, and character-focused narratives, Part I should be illuminating 
for anyone, including individuals with no Software programming experience. 


2. See bi t. 1 y/ai i ndexl 8 for a review of machine performance relative to humans. 
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In contrast, Parts II through IV are intended for Software developers, data scientists, 
researchers, analysts, and others who would like to leam how to apply deep learning 
techniques in their field. In these parts of the book, essential underlying theory is covered 
in a manner that minimizes mathematical formulas, relying instead on intuitive visuals and 
hands-on examples in Python. Alongside this theory, working code run-throughs avail- 
able in accompanying Jupyter notebooks 3 facilitate a pragmatic understanding of the prin- 
cipal families of deep learning approaches and applications: machine vision (Chapter 10), 
natural language processing (Chapter 11), image generation (Chapter 12), and game 
playing (Chapter 13). For clarity, wherever we refer to code, we will provide it in fixed- 
width font, 1 i ke thi s. For further readability, in code chunks we also include the 
default Jupyter styling (e.g., numbers in green, strings in red, etc.). 

If you fmd yourself yearning for more detailed explanations of the mathematical 
and statistical foundations of deep learning than we offer in this book, our two favorite 
options for further study are: 

1. Michael Nielsens e-book Neural Networks and Deep Learning , 4 which is short, makes 
use of fun interactive applets to demonstrate concepts, and uses mathematical nota- 
tion similar to ours 

2. Ian Goodfellow (introduced in Chapter 3), Yoshua Bengio (Figure 1.10), and Aaron 
Courvilles book Deep Learning, 5 which comprehensively covers the math that 
underlies neural network techniques 

Scattered throughout this book, you will find amiable trilobites that would like to 
provide you with tidbits of unessential reading that they think you may fmd interesting or 
helpful. The reading trilobite (as in Figure P.3) is a bookworm who enjoys expanding your 
knowledge. The trilobite callingfor your attention, meanwhile (as in Figure P.4), has noticed 
a passage of text that may be problematic, and so would like to clarify the situation. In 
addition to trilobites habituated withm sidebars, we made liberal use offootnotes. These 



Figure P.3 The reading trilobite enjoys expanding your knowledge. 


3. github.com/the-deep-1earners/deep-1earning-i11ustrated 

4. Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press. Available for free at: 
neuralnetworksanddeeplearning.com 

5. Goodfellow, I., et al. (2016). Deep Learning. MIT Press. Available for free at: deeplearningbook.org 
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Figure P.4 This trilobite calls attentiori to tricky passages of text. Look out for it! 


are likewise not essential reading but provide quick explanations of new ternis and abbre- 
viations, as well as citations of seminal papers and other references for you to follow up 
with ifyoure so inclined. 

For niuch of this books content, corresponding video tutorials are also available. Al- 
though the book provided us with an opportunity to flesh out theoretical concepts more 
thoroughly, the videos enable you to become familiar with our Jupyter notebooks from 
a different perspective, in which the importance of each line of code is described verbally 
as it is typed out. 6 The video tutorial series is spread across three tities, each of which 
parallels particular chapters of the book: 

1. Deep Learning with TensorFlow LiveLessons: 7 Chapter 1 and Chapters 5 through 10 

2. Deep Learning for Natural Language Processing LiveLessons: 8 Chapters 2 and 11 

3. Deep Reinforcement Learning and GANs LiveLessons: 9 Chapters 3, 4, 12, and 13 


Register your copy of Deep Learning Ulustrated on the InformlT site for convenient 
access to updates and corrections as they become available. To start the registration 
process, go to i nformi t. com/regi ster and log in or create an account. Enter the 
product ISBN (9780135116692) and click Subnut. Look on the Registered Products 
tab for an Access Bonus Content link next to this product, and follow that link to 
access any available bonus materials. Ifyou would like to be notified of exclusive 
offers on new editions and updates, please check the box to receive email from us. 


6. Many of the Jupyter notebooks covered in this book are derived directly from the videos, which were all recorded 
prior to writing. In some places, we decided to update the code for the book, so while the video version and the 
book version of a given code notebook align quite closely, they may not always be stricdy identical. 

7. Krohn, J. (2017). Deep Learning with TensorFlow LiveLessons: Applications of Deep Neural Networks to Machine 
Learning Tasks (video course). Boston: Addison-Wesley. 

8. Krohn, J. (2017). Deep Learning for Natural Language Processing LiveLessons: Applications of Deep Neural Networks 
to Machine Learning Tasks (video course). Boston: Addison-Wesley. 

9. Krohn, J. (2018). Deep Reinforcement Learning and GANs LiveLessons: Advanced Topics in Deep Learning (video 
course). Boston: Addison-Wesley. 
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Biological and Machine Vision 


I hroughout chis chapter and much of this book. the visual system of biological organ- 
isms is used as an analogy to bring deep learning to, uni . . . life. In addition to conveying 
a high-level understanding of what deep learning is, this analogy provides insight into 
how deep learning approaches are so powerful and so broadly applicable. 

Biological Vision 

Five hundred fifty million years ago, in the prehistoric Cambrian period, the number 
of species on the planet began to surge (Figure 1.1). From the fossil record, there is evi- 
dence 1 that this explosion was driven by the development of light detectors in the trilo- 
bite, a small marine animal related to modern crabs (Figure 1.2). A visual system, even 
a primitive one, bestows a delightful bounty of fresh capabilities. One can, for example, 
spot food, foes, and friendly-looking mates at some distance. Other senses, such as smell, 
enable animals to detect these as well, but not with the accuracy and light-speed pace of 
vision. Once the trilobite could see, the hypothesis goes, this set off an arms race that 
produced the Cambrian explosion: The trilobite’s prey, as well as its predators, had to 
evolve to survive. 

In the half-billion years since trilobites developed vision, the complexity of the sense 
has increased considerably. Indeed, in modern mammals, a large proportion of the cerebral 
cortex —the outer gray matter of the brain—is involved in visual perception. 2 At Johns 


1. Parker, A. (2004). In the Blink of an Eye: How Vision Sparked the Big Bang of Evolution. New York: Basic Books. 

2. A couple of tangential facts about the cerebral cortex: First, it is one of the more recent evolutionary develop- 
ments of the brain, contributing to the complexity of mammal behavior relative to the behavior of older classes 
of animals like reptiles and amphibians. Second, while the brain is informally referred to as gray matter because the 
cerebral cortex is the brain s external surface and this cortical tissue is gray in color, the bulk of the brain is in fact 
white matter. By and large, the white matter is responsible for carrying information over longer distances than the 
gray matter, so its neurons have a white-colored, fatty coating that hurries the pace of signal conduction. A coarse 
analogy could be to consider neurons in the white matter to act as “highways.” These high-speed motorways have 
scant on-ramps or exits, but can transport a signal from one part of the brain to another lickety-split. In contrast, 
the “local roads” of gray matter facilitate myriad opportunities for interconnection between neurons at the expense 
of speed. A gross generalization, therefore, is to consider the cerebral cortex—the gray matter—as the part of the 
brain where the most complex computations happen, affording the animals with the largest proportion of it—such 
as mammals, particularly the great apes like Homo sapiens —their complex behaviors. 
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Figure 1.1 The number of species on our planet began to increase rapidly 
550 million years ago, duringthe prehistoric Cambrian period. “Genera” are categories 

of related species. 



Figure 1.2 A bespectacled trilobite 


Hopkins University in the late 1950s, the physiologists David Hubel and Torsten Wiesel 
(Figure 1.3) began carrying out their pioneering research on how visual information is 
processed in the mammalian cerebral cortex, 3 work that contributed to their later being 
awarded a Nobel Prize. 4 As depicted in Figure 1.4, Hubel and Wiesel conducted their re¬ 
search by showing images to anesthetized cats while simultaneously recording the activity 
of individual neurons from the primary visual cortex, the first part of the cerebral cortex to 
receive visual input from the eyes. 

Projecting slides onto a screen, Hubel and Wiesel began by presenting simple shapes 
like the dot shown in Figure 1.4 to the cats. Their initial results were disheartening: Their 
efforts were met with no response from the neurons of the primary visual cortex. They 
grappled with the frustration of how these cells, which anatomically appear to be the 
gateway for visual information to the rest of the cerebral cortex, would not respond to 


3. Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat’s striate cortex. Tite Journal 
of Physiology, 148, 574—91. 

4. The 1981 Nobel Prize in Physiology or Medicine, shared with American neurobiologist Roger Sperry. 
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Figure 1.3 The Nobel Prize-winning neurophysiologists Torsten Wiesel (left) and 

David Hubel 



Figure 1.4 Hubel and Wiesel used a light projector to present slides to anesthetized 
cats while they recorded the activity of neurons in the cats’ primary visual cortex. In the 
experiments, electrical recording equipment was implanted within the cat's skull. Instead 
of illustrating this, we suspected it would be a fair bit more palatable to use a lightbulb to 
represent neuron activation. Depicted in this figure is a primary visual cortex neuron 
being serendipitously activated by the straight edge of a slide. 


visual stimuli. Distraught, Hubel and Wiesel tried in vain to stimulate the neurons by 
jumping and waving their arms in front of the cat. Nothing. And then, as with many of 
the great discoveries, from X-rays to penicillin to the microwave oven, Hubel and Wiesel 
made a serendipitous observation: As they removed one of their slides from the projector, 
its straight edge elicited the distinctive crackle of their recording equipment to alert them 
that a primary visual cortex neuron was firing. Overjoyed, they celebrated up and down 
thejohns Hopkins laboratory corridors. 
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The serendipitously crackling neuron was not an anomaly. Through further experi- 
mentation, Hubel and Wiesel discovered that the neurons that receive visual input from 
the eye are in general most responsive to simple, straight edges. Fittingly then, they 
named these cells simple neurons. 

As shown in Figure 1.5, Hubel and Wiesel determined that a given simple neuron 
responds optimally to an edge at a particular, specific orientation. A large group of simple 
neurons, with each specialized to detect a particular edge orientation, together is able to 
represent ali 360 degrees of orientation. These edge-orientation detecting simple cells 
then pass along information to a large number of so-called complex neurons. A given com¬ 
plex neuron receives visual information that has already been processed by several simple 
cells, so it is well positioned to recombine multiple line orientations into a more complex 
shape like a corner or a curve. 

Figure 1.6 illustrates how, via many hierarchically organized layers of neurons feed- 
ing information into increasingly higher-order neurons, gradually more complex visual 
stimuli can be represented by the brain. The eyes are focused on an image of a mouse s 







Figure 1.5 A simple cell in the primary visual cortex of a cat fires at different rates, 
depending on the orientation of a line shown to the cat. The orientation of the line is 
provided in the left-hand column, while the right-hand column shows the firing (electrical 
activity) in the cell over time (one second). A vertical line (in the fifth row from the top) 
causes the most electrical activity for this particular simple cell. Lines slightly off vertical 
(in the intermediate rows) cause less activity for the cell, while lines approaching 
horizontal (in the topmost and bottommost rows) cause little to no activity. 
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Figure 1.6 A caricature of how consecutive iayers of biological neurons represent 
visual information in the brain of, for example, a cat or a human 


head. Photons of light stimulate neurons located in the retina of each eye, and this raw 
visual information is transmitted from the eyes to the primary visual cortex of the brain. 
The first layer of primary visual cortex neurons to receive this input—Hubel and Wiesels 
simple cells —are specialized to detect edges (straight lines) at specific orientations. There 
would be many thousands of such neurons; for simplicity, we’re only showing four in 
Figure 1.6. These simple neurons relay information about the presence or absence of 
lines at particular orientations to a subsequent layer of complex cells, which assimilate and 
recombine the information, enabling the representation of more complex visual stimuli 
such as the curvature of the mouses head. As information is passed through several sub¬ 
sequent Iayers, representations of visual stimuli can incrementally become more complex 
and more abstract. As depicted by the far-right layer of neurons, following many Iayers of 
such hierarchical processing (we use the arrow with dashed lines to imply that many more 
Iayers of processing are not being shown), the brain is ultimately able to represent visual 
concepts as abstract as a mouse, a cat, a bird, or a dog. 

Today, through countless subsequent recordings from the cortical neurons of brain- 
surgery patients as well as noninvasive techniques like magnetic resonance imaging 
(MRI), 5 neuroscientists have pieced together a fairly high-resolution map of regions 
that are specialized to process particular visual stimuli, such as color, motion, and faces 
(see Figure 1.7). 


5. EspeciaUy functional MRI, which provides insight into which regions of the cerebral cortex are notably active 
or inactive when the brain is engaged in a particular activity. 
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V3a 


fusiform 
face area 


Figure 1.7 Regions of the visual cortex. The VI region receives input from the eyes and 
contains the simple cells that detect edge orientations. Through the recombination of 
information via myriad subsequent layers of neurons (including within the V2, V3, and 
V3a regions), increasingly abstract visual stimuli are represented. In the human brain 
(shown here), there are regions containing neurons with concentrations of specializations 
in, for example, the detection of color (V4), motion (V5), and people’s faces (fusiform 

face area). 


Machine Vision 


We haven’t been discussing the biological visual system solely because it’s interesting 
(though hopefully you did find the preceding section thoroughly interesting). We have 
covered the biological visual system primarily because it serves as the inspiration for the 
modern deep learning approaches to machine vision, as will become ciear in this section. 

Figure 1.8 provides a concise historical timeline of vision in biological organisms as 
well as machines. The top timeline, in blue, highlights the development of vision in 
trilobites as well as Hubel and Wiesels 1959 publication on the hierarchical nature of the 
primary visual cortex, as covered in the preceding section. The machine vision timeline is 
split into two parallel streams to call attention to two alternative approaches. The middle 
timeline, in pink, represents the deep learning track that is the focus of our book. The 
bottom timeline, in purple, meanwhile represents the traditional machine learning (ML) 
path to vision, which—through contrast—will clarify why deep learning is distinctively 
powerful and revolutionary. 


The Neocognitron 

Inspired by Hubel and Wiesels discovery of the simple and complex cells that forni 
the primary visual cortex hierarchy, in the late 1970s the Japanese electrical engineer 
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Figure 1.8 Abridged timeline of biological and machine Vision, highlighting the key 
historical moments in the deep learning and traditional machine learning approaches to 
vision that are covered in this section 


Kunihiko Fukushima proposed an analogous architecture for machine vision, which he 
named the neocognitron . 6 There are two particular items to note: 

1. Fukushima referred to Hubel and Wiesels work explicitly in his writing. Indeed, 
his paper refers to three of their landmark articles on the organization of the pri- 
mary visual cortex, including borrowing their “simple” and “complex” cell lan- 
guage to describe the first and second layers, respectively, of his neocognitron. 

2. By arranging artificial neurons 7 in this hierarchical nianner, these neurons—like 
their biological inspiration in Figure 1.6—generally represent line orientations in 
the cells of the layers closest to the raw visual iniage, while successively deeper 
layers represent successively complex, successively abstract objects. To clarify this 
potent property of the neocognitron and its deep learning descendants, we go 
through an interactive example at the end of this chapter that demonstrates it. 8 

LeNet-5 

While the neocognitron was capable of, for example, identifying handwritten char- 
acters, 9 the accuracy and efficiency ofYann LeCun (Figure 1.9) and Yoshua Bengios 
(Figure 1.10) LeNet-5 model 10 made it a significant development. LeNet-5’s hierarchical 


6. Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern 
recognition unaffected by shift in position. Biological Cynbernetics, 36, 193—202. 

7. We define precisely what artificial neurons are in Chapter 7. For the moment, its more than sufficient to think 
of each artificial neuron as a speedy little algorithm. 

8. Specifically, Figure 1.19 demonstrates this hierarchy with its successively abstract representations. 

9. Fukushima, K., & Wake, N. (1991). Handwritten alphanumeric character recognition by the neocognitron. 
IEEE Transactions on Neural Networks, 2, 355—65. 

10. LeCun, Y., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 2, 
355-65. 
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Figure 1.9 Paris-born Yann LeCun is one of the preeminent figures in artificial neural 
network and deep learning research. LeCun is the founding director of the New York 
University Center for Data Science as well as the director of AI research at the social 

network Facebook. 



Figure 1.10 Yoshua Bengio is another of the leading characters in artificial neural 
networks and deep learning. Born in France, he is a computer Science professor at the 
University of Montreal and codirects the renowned Machines and Brains program at the 
Canadian Institute for Advanced Research. 

architecture (Figure 1.11) built on Fukushimas lead and the biological inspiration un- 
covered by Hubel and Wiesel. 11 In addition, LeCun and his colleagues benefited from 
superior data for training their model, 12 faster processing power, and, critically, the back- 
propagation algorithm. 

Backpropagation , often abbreviated to backprop, facilitates efficient learning through- 
out the layers of artificial neurons within a deep learning model. 13 Together with the 
researchers’ data and processing power, backprop rendered LeNet-5 sufficiently reliable 


11. LeNet-5 was the first convolutional neural network, a deep learning variant that dominates modern machine vision 
and that we detail in Chapter 10. 

12. Their classic dataset, the handwritten MNIST digits, is used extensively in Part II, “Essential Theory 
Illustrated.” 

13. We examine the backpropagation algorithm in Chapter 7. 
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Figure 1.11 LeNet-5 retains the hierarchicai architecture uncovered in the primary 
visual cortex by Hubel and Wiesel and leveraged by Fukushima in his neocognitron. As in 
those other systems, the leftmost layer represents simple edges, while successive layers 
represent increasingly complex features. By Processing information in this way, a 
handwritten “2" should, for example, be correctly recognized as the number two 
(highlighted by the green output shown on the right). 


to become an early commercial application of deep learning: It was used by the United 
States Postal Service to automate the reading of ZIP codes 14 written on mail envelopes. 

In Chapter 10, on machine vision, you will experience LeNet-5 firsthand by designing it 
yourself and training it to recognize handwritten digits. 

In LeNet-5, Yann LeCun and his colleagues had an algorithm that could correctly 
predict the handwritten digits that had been drawn without needing to include any ex- 
pertise about handwritten digits in their code. As such, LeNet-5 provides an opportunity 
to introduce a fundamental difference between deep learning and the traditional machine 
learning ideology. As conveyed by Figure 1.12, the traditional machine learning approach 
is characterized by practitioners investing the bulk of their efforts into engineering fea¬ 
tures. This feature engineering is the application of elever, and often elaborate, algorithms 
to raw data in order to preprocess the data into input variables that can be readily mod- 
eled by traditional statistical techniques. These techniques—such as regression, random 
forest, and support vector machine—are seldom effective on unprocessed data, and so 
the engineering of input data has historically been a prime focus of machine learning 
professionals. 

In general, a minority of the traditional ML practitioners time is spent optimizing 
ML models or selecting the most effective one from those available. The deep learning 
approach to modeling data turns these priorities upside down. The deep learning pmetitioner 


14. The USPS term for postal code. 
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Figure 1.12 Feature engineering—the transformation of raw data into thoughtfully 
transformed input variables—often predominates the application of traditional machine 
learning algorithms. In contrast, the application of deep learning often involves little to no 
feature engineering, with the majority of time spent instead on the design and tuning of 

model architectures. 


typically spends little to none of her time engineering features, instead spending it modeling data 
with various artificial neuraI network architectures that process the raw inputs into useful features 
automatically. This distinctiori between deep learning and traditional machine learning is a 
core theme of this book. The next section provides a classic example of feature engineer¬ 
ing to elucidate the distinction. 

The Traditional Machine Learning Approach 

Following LeNet-5, research into artificial neural networks, including deep learning, feli 
out of favor. The consensus became that the approachs automated feature generation 
was not pragmatic—that even though it worked well for handwritten character recogni- 
tion, the feature-free ideology was perceived to have limited breadth of applicability. 15 
Traditional machine learning, including its feature engineering, appeared to hold more 
promise, and funding shifted away from deep learning research. 16 

To make ciear what feature engineering is, Figure 1.13 provides a celebrated example 
from Paul Viola and Michael Jones in the early 2000s. 17 Viola andjones employed rect- 
angular filters such as the vertical or horizontal black-and-white bars shown in the figure. 
Features generated by passing these filters over an image can be fed into machine learn¬ 
ing algorithms to reliably detect the presence of a face. This work is notable because the 


15. At the time, there were stumbling blocks associated with optimizing deep learning models that have since been 
resolved, including poor weight initializations (covered in Chapter 9), covariate shift (also in Chapter 9), and the 
predominance of the relatively in efficient sigmoid activation function (Chapter 6). 

16. Public funding for artificial neural network research ebbed globally, with the notable exception of continued 
support from the Canadian federal government, enabling the Universities of Montreal, Toronto, and Alberta to 
become powerhouses in the field. 

17. Viola, P., & Jones, M. (2001). Robust real-time face detection. International Journal of Computer Vision, 57, 
137-54. 
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Figure 1.13 Engineered features leveraged by Viola and Jones (2001) to detect faces 
reliably. Their efficient algorithm found its way into Fujifilm cameras, facilitating 

real-time auto-focus. 

algorithm was efficient enough to be the first real-time face detector outside the realm of 
biology. 18 

Devising elever face-detecting filters to process raw pixels into features for input into 
a machine learning rnodel was accomplished via years of research and collaboration on 
the characteristics of faces. And, of course, it is limited to detecting faces in general, as 
opposed to being able to recognize a particular face as, say, Angela Merkels or Oprah 
Winfreys. To develop features for detecting Oprah in particular, or for detecting some 
non-face class of objects like houses, cars, or Yorkshire Terriers, would require the devel- 
opment of expertise in that category, something that could again take years of academic- 
community collaboration to exeeute both efficiently and accurately. Hmm, if only we 
could circumnavigate ali that time and effort somehow! 

ImageNet and the ILSVRC 

As mentioned earlier, one of the advantages LeNet-5 had over the neocognitron was a 
larger, high-quality set of training data. The next breakthrough in neural networks was 
also facilitated by a high-quality public dataset, this time much larger. ImageNet, a la- 
beled index of photographs devised by Fei-Fei Li (Figure 1.14), armed machine vision 
researchers with an immense catalog of training data. 19-2 " For reference, the handwritten 
digit data used to train LeNet-5 contained tens of thousands of images. ImageNet, in 
contrast, contains tens of millions. 

The 14 mUlion images in the ImageNet dataset are spread across 22,000 categories. 
These categories are as diverse as Container ships, leopards, starfish, and elderberries. Since 
2010, Li has run an open challenge called ILSVRC (the ImageNet Large Scale Visual 
Recognition Challenge) on a subset of the ImageNet data that has become the premier 


18. A few years later, the algorithm found its way into digital Fujifilm cameras, facilitating autofocus on faces for 
the first time—a now everyday attribute of digital cameras and smartphones alike. 

19. image-net.org 

20. Deng, J., et al. (2009). ImageNet: A large-scale hierarchical image database. Proceedings of the Conference on 
Computer Vision and Pattern Recognition. 
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Figure 1.14 The hulking ImageNet dataset was the brainchild of Chinese-American 
computer Science professor Fei-Fei Li and her colleagues at Princeton in 2009. Now a 
faculty member at Stanford University, Li is also the chief scientist of A.I./ML for Google’s 

cloud platform. 


ground for assessing the worlds state-of-the-art machine vision algorithms. The ILSVRC 
subset consists of 1.4 million images across 1,000 categories. In addition to providing a 
broad range of categories, many of the selected categories are breeds of dogs, thereby 
evaluating the algorithms’ ability not only to distinguish widely varying images but also to 
specialize in distinguishing subtly varying ones. 21 

AlexNet 

As graphed in Figure 1.15, in the first two years ofthe ILSVRC all algorithms entered 
into the competition hailed from the feature-engineering-driven traditional machine 
learning ideology. In the third year, all entrants except one were traditional ML algorithms. 
If that one deep learning model in 2012 had not been developed or if its creators had 
not competed in ILSVRC, then the year-over-year image classification accuracy would 
have been negligible. Instead, Alex Krizhevsky and Ilya Sutskever—working out of the 
University of Toronto lab led by Geoffrey Hinton (Figure 1.16)—crushed the existing 
benchmarks with their submission, today referred to as AlexNet (Figure 1.17). 22,23 This 
was a watershed moment. In an instant, deep learning architectures emerged from the 
fringes of machine learning to its fore. Academics and commercial practitioners scram- 
bled to grasp the fundamentals of artificial neural networks as well as to create Software 
libraries—many of them open-source—to experiment with deep learning models on their 
own data and use cases, be they machine vision or otherwise. As Figure 1.15 lllustrates, in 


21. On your own time, try to distinguish photos of Yorkshire Terriers from Australian Silky Terriers. Its tough, 
but Westminster Dog Show judges, as well as contemporary machine vision models, can do it. Tangentially, these 
dog-heavy data are the reason deep learning models trained with ImageNet have a disposition toward “dreaming” 
about dogs (see, e.g., deepdreamgenerator.com). 

22. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural 
networks. Advances in Neural Information Processing Systems, 25. 

23. The images along the bottom of Figure 1.17 were obtained from Yosinski, J., et al. (2015). Understanding 
neural networks through deep visualization. arXiv: 1506.06579. 
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Figure 1.15 Performance of the top entrants to the ILSVRC by year. AlexNet was the 
victor by a head-and-shoulders (40 percent!) margin in the 2012 iteration. AII of the best 
algorithms since then have been deep learning models. In 2015, machines surpassed 

human accuracy. 



Figure 1.16 The eminent British-Canadian artificial neural network pioneer Geoffrey 
Hinton, habitually referred to as “the godfather of deep learning” in the popular press. 
Hinton is an emeritus professor at the University of Toronto and an engineering fellow at 
Google, responsible for managing the search giant’s Brain Team, a research arm, in 
Toronto. In 2019, Hinton, Yann LeCun (Figure 1.9), and Yoshua Bengio (Figure 1.10) were 
jointly recognized with the Turing Award—the highest honor in computer Science—for 

their work on deep learning. 
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Figure 1.17 AlexNefs hierarchical architecture is reminiscent of LeNet-5, with the first 
(left-hand) layer representing simple visual features like edges, and deeper layers 
representing increasingly complex features and abstract concepts. Shown at the bottom 
are examples of images to which the neurons in that layer maximally respond, recalling 
the layers of the biological visual system in Figure 1.6 and demonstrating the hierarchical 
increase in visual feature complexity. In the example shown here, an image of a cat input 
into LeNet-5 is correctly identified as such (as implied by the green “CAT” output). 

“CONV” indicates the use of something called a convolutional layer, and "FC” is a fully 
connected layer; we formally introduce these layer types in Chapters 7 and 10, 

respectively. 


the years since 2012 ali of the top-performing models in the ILSVRC have been based on 
deep learning. 

Although the hierarchical architecture of AlexNet is reminiscent of LeNet-5, there 
are three principal factors that enabled AlexNet to be the state-of-the-art machine vision 
algorithm in 2012. First is the training data. Not only did Krizhevsky and his colleagues 
have access to the massive ImageNet index, they also artificially expanded the data avail- 
able to them by applying transformations to the training images (you, too, will do this in 
Chapter 10). Second is processing power. Not only had computing power per unit of cost 
increased dramatically from 1998 to 2012, but Krizhevsky, Hinton, and Sutskever also 
programmed two GPUs 24 to train their large datasets with previously unseen efficiency. 
Third is architectural advances. AlexNet is deeper (has more layers) than LeNet-5, and 


24. Graphical processing units: These are designed primarily for rendering video games but are well suited to 
performing the matrix multiplication that abounds in deep learning across hundreds ofparallel computing threads. 
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it takes advantage ofboth a new type ofartificial neuron 25 and a nifty trick 26 that helps 
generalize deep learning models beyond the data theyre trained on. As with LeNet-5, 
you will build AlexNet yourself in Chapter 10 and use it to classify images. 

Our ILSVRC case study underlines why deep learning models like AlexNet are so 
widely useful and disruptive across industries and computational applications: They 
dramatically reduce the subject-matter expertise required for building highly accurate 
predictive models. This trend away from expertise-driven feature engineering and toward 
surprisingly powerful automatic-feature-generating deep learning models has been preva- 
lently borne out across not only vision applications, but also, for example, the playing of 
complex games (the topic of Chapter 4) and natural language processing (Chapter 2). 27 
One no longer needs to be a specialist in the visual attributes of faces to create a face- 
recognition algorithm. One no longer requires a thorough understanding of a games 
strategies to write a program that can master it. One no longer needs to be an author- 
ity on the structure and semantics of each of several languages to develop a language- 
translation tool. For a rapidly growing list of use cases, one s ability to apply deep learning 
techniques outweighs the value of domain-specific proficiency. While such proficiency 
formerly may have necessitated a doctoral degree or perhaps years of postdoctoral research 
within a given domain, a functional level of deep learning capability can be developed 
with relative ease—as by working through this book! 

TensorFlow Playground 

For a fun, interactive way to crystallize the hierarchical, feature-learning nature of deep 
learning, make your way to the TensorFlow Playground at bi t. 1 y/TFpl ayground. When 
you use this custom link, your network should automatically look similar to the one 
shown in Figure 1.18. In Part II we return to define all ofthe ternis on the screen; for the 
present exercise, they can be safely ignored. It sutfices at this time to know that this is a 
deep learning model. The model architecture consists of six layers of artificial neurons: an 
input layer on the left (below the “FEATURES” heading), four “HIDDEN LAYERS” 
(which bear the responsibility of learning), and an “OUTPUT” layer (the grid on the 
far right ranging from —6 to +6 on both axes). The networks goal is to learn how to 
distinguish orange dots (negative cases) from blue dots (positive cases) based solely on 
their location on the grid. As such, in the input layer, we are only feeding in two pieces 
of information about each dot: its horizontal position (Xi) and its vertical position (X 2 ). 
The dots that will be used as training data are shown by default on the grid. By clicking 
the Show test data toggle, you can also see the location of dots that will be used to assess 
the performance of the network as it learns. Critically, these test data are not available to 
the network while it’s learning, so they help us ensure that the network generalizes well 
to new, unseen data. 


25. The rectified linear unit (ReLU), which is introduced in Chapter 6. 

26. Dropout, introduced in Chapter 9. 

27. An especially entertaining recounting ofthe disruption to the field ofmachine translation is provided by Gideon 
Lewis-Kraus in his article “The Great A.I. Awakening,” published in the New York Times Magazine on December 
14, 2016. 
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Figure 1.18 This deep neural network is ready to learn how to distinguish a spiral of 
orange dots (negative cases) from blue dots (positive cases) based on their position on 
the Xi and X 2 axes of the grid on the right. 


Click the prominent Play arrow in the top-left corner. Enable the network to train 
until the “Training loss” and “Test loss” in the top-right corner have both approached 
zero—say, less than 0.05. How long this takes will depend on the hardware youre using 
but hopefully will not be more than a few minutes. 

As captured in Figure 1.19, you should now see the network s artificial neurons rep- 
resenting the input data, with increasing complexity and abstraction the deeper (further 
to the right) they are positioned—as in the neocognitron, LeNet-5 (Figure 1.11), and 
AlexNet (Figure 1.17). Every time the network is run, the neuron-level details of how 
the network solves the spiral classification problem are unique, but the general approach 
remains the sanie (to see this for yourself, you can refresh the page and retrain the net¬ 
work). The artificial neurons in the leftmost hidden layer are specialized in distinguishing 
edges (straight lines), each at a particular orientation. Neurons from the first hidden layer 
pass information to neurons in the second hidden layer, each of which recombines the 
edges into slightly more complex features like curves. The neurons in each successive 
layer recombine information from the neurons of the preceding layer, gradually increasing 
the complexity and abstraction of the features the neurons can represent. By the final 
(rightmost) layer, the neurons are adept at representing the intricacies of the spiral shape, 
enabling the network to accurately predict whether a dot is orange (a negative case) or 
blue (a positive case) based 011 its position (its X\ and X 2 coordinates) in the grid. Hover 
over a neuron to project it onto the far-right “OUTPUT” grid and examine its individ- 
ual specialization in detail. 
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Figure 1.19 The network after training 


Quick, Draw! 

To interactively experience a deep learning network carrying out a machine vision task 
in real time, navigate to qui ckdraw. wi thgoogl e. com to play the Quick, Draw! game. 
Click Let’s Draw! to begin playing the game. You will be prompted to draw an object, and 
a deep learning algorithm will guess what you sketcli. By the end of Chapter 10, we will 
have covered ali of the theory and practical code examples needed to devise a machine 
vision algorithm akin to this one. To boot, the drawings you create will be added to the 
dataset that you’11 leverage in Chapter 12 when you create a deep learning model that can 
convincingly mimic human-drawn doodles. Hold on to your seat! We’re embarking on a 
fantastic ride. 

Summary 

In this chapter, we traced the history of deep learning from its biological inspiration 
through to the AlexNet triumph in 2012 that brought the technique to the fore. All the 
while, we reiterated that the hierarchical architecture of deep learning models enables 
them to encode increasingly complex representations. To concretize this concept, we 
concluded with an interactive demonstration of hierarchical representations in action by 
training an artificial neural network in the TensorFlow Playground. In Chapter 2, we will 
expand on the ideas introduced in this chapter by moving from vision applications to 
language applications. 
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Human and Machine Language 


I n Chapter 1, we introduced the high-level theory of deep learning via analogy to the 
biological visual System. AII the while, we highlighted that one of the technique s core 
strengths lies in its ability to learn features automatically from data. In this chapter, we 
build atop our deep learning foundations by exaniining how deep learning is incorporated 
into human language applications, with a particular emphasis on how it can automatically 
learn features that represent the meaning of words. 

The Austro-British philosopher Ludwig Wittgenstein famously argued, in his posthu- 
mous and seminal work Philosophical Investigations, “The meaning of a word is its use in 
the language.” 1 He further wrote, “One cannot guess how a word functions. One has to 
look at its use, and learn from that.” Wittgenstein was suggesting that words on their own 
have no real meaning; rather, it is by their use within the larger context of that language 
that we’re able to ascertain their meaning. As you’ll see through this chapter, natural lan¬ 
guage processing with deep learning relies heavily on this premise. Indeed, the word2vec 
technique we introduce for converting words into numeric model mputs explicitly de¬ 
rives its semantic representation of a word by analyzing it within its contexts across a large 
body of language. 

Armed with this notion, we begin by breaking down deep learning for natural lan¬ 
guage processing (NLP) as a discipline, and then we go on to discuss modern deep learn¬ 
ing techniques for representing words and language. By the end of the chapter, you 
should have a good grasp on what is possible with deep learning and NLP, the ground- 
work for writing such code in Chapter 11. 

Deep Learning for Natural Language 
Processing 

The two core concepts in this chapter are deep learning and natural language processing. Ini- 
tially, we cover the relevant aspects of these concepts separately, and then we weave them 
together as the chapter progresses. 


1. Wittgenstein, L. (1953). Philosophical Investigations. (Anscombe, G., Trans.). Oxford, UK: Basii Blackwell. 
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Deep Learning Networks Leam Representations 
Automatically 

As established way back in this books Preface, deep learning can be defined as the lay- 
ering of simple algorithms called artificial neurons into networks several layers deep. Via 
the Venn diagram in Figure 2.1, we show how deep learning resides within the machine 
learning family of representation learning approaches. The representation learning family, 
which contemporary deep learning dominates, includes any techniques that learn features 
from data automatically. Indeed, we can use the terms “feature” and “representation” 
interchangeably. 

Figure 1.12 lays the foundation for understanding the advantage of representation 
learning relative to traditional machine learning approaches. Traditional ML typically 
works well because of elever, human-designed code that transforms raw data—whether 
it be images, audio of speech, or text from documents—into input features for machine 
learning algorithms (e.g., regression, random forest, or support vector machines) that are 
adept at weighting features but not particularly good at learning features from raw data 
directly. This manual creation of features is often a highly specialized task. For working 
with language data, for example, it might require graduate-level training in linguisties. 

A primary benefit of deep learning is that it eases this requirement for subject-matter 
expertise. Instead of manually curating input features from raw data, one can feed the 
data directly into a deep learning model. Over the course of many examples provided 
to the deep learning model, the artificial neurons of the first layer of the network learn 
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Figure 2.1 Venn diagram that distinguishes the traditional family from the 
representation learning family of machine learning techniques 
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how to represent simple abstractions of these data, while each successive layer learns to 
represent increasingly complex nonlinear abstractions on the layer that precedes it. As 
you’ll discover in this chapter, this isn’t solely a matter of convenience; learning features 
automatically has additional advantages. Features engineered by humans tend to not be 
comprehensive, tend to be excessively specific, and can involve lengthy, ongoing loops of 
feature ideation, design, and validation that could stretch for years. Representation learn¬ 
ing models, meanwhile, generate features quickly (typically over hours or days of model 
training), adapt straightforwardly to changes in the data (e.g., new words, meanings, or 
ways of using language), and adapt automatically to shifts in the problem being solved. 

Natural Language Processing 

Natural language processing is a field of research that sits at the intersection of com¬ 
puter Science, hnguistics, and artificial intelligence (Figure 2.2). NLP involves taking the 
naturally spoken or naturally written language of humans—such as this sentence youre 
reading right now—and processing it with machines to automatically complete sonte 
task or to rnake a task easier for a human to do. Examples of language use that do not fall 
under the umbrella of natural language could include code written in a Software language 
or short strings of characters within a spreadsheet. 

Examples of NLP in industry include: 

■ Classifying documents : using the language within a document (e.g., an entail, a Tweet, 
or a review of a filni) to classify it into a particular category (e.g., high urgency, 
positive sentiment, or predicted direction of the price of a companys stock). 

■ Machine translation: assisting language-translation firms with machine-generated sug- 
gestions frorn a source language (e.g., English) to a target language (e.g., Gernian or 
Mandarin); increasingly, fully automatic—though not always perfect—translations 
between languages. 

■ Search engines : autocompleting users’ searches and predicting what information or 
website theyre seeking. 



Figure 2.2 NLP sits at the intersection of the fields of computer Science, iinguistics, 

and artificial intelligence. 
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■ Speech recognitioni interpreting voice commands to provide information or take 
action, as with virtual assistants like Amazon’s Alexa, Apple s Siri, or Microsoft’s 
Cortana. 

■ Chatbots: carrying out a natural conversation for an extended period of time; 
though this is seldom done convincingly today, they are nevertheless helpful for 
relatively linear conversations on narrow topics such as the routine components of a 
firms customer-service phone calls. 

Sorne of the easiest NLP applications to build are spell checkers, synonym suggesters, 
and keyword-search querying tools. These simple tasks can be fairly straightforwardly 
solved with deterministic, rules-based code using, say, reference dictionaries or the- 
sauruses. Deep learning models are unnecessarily sophisticated for these applications, and 
so they aren’t discussed further in this book. 

Intermediate-complexity NLP tasks include assigning a school-grade reading level 
to a document, predicting the most likely next words while making a query in a search 
engine, classifying documents (see earlier list), and extracting information like prices or 
named entities 2 from documents or websites. These intermediate NLP applications are 
well suited to solving with deep learning models. In Chapter 11, for example, you’ll 
leverage a variety of deep learning architectures to predict the sentiment of film reviews. 

The most sophisticated NLP implementations are required for machine translation (see 
earlier list), automated question-answering, and chatbots. These are tricky because they 
need to handle application-critical nuance (as an example, humor is particularly transient), 
a response to a question can depend on the intermediate responses to previous questions, 
and meaning can be conveyed over the course of a lengthy passage of text consisting 
of many sentences. Complex NLP tasks like these are beyond the scope of this book; 
however, the content we cover will serve as a superb foundation for their development. 

A Brief History of Deep Learning for NLP 

The timeline in Figure 2.3 calls out recent milestones in the application of deep learn¬ 
ing to NLP. This timeline begins in 2011, when the University ofToronto computer 
scientist George Dahl and his colleagues at Microsoft Research revealed the first major 



I-1-1-1-1-1-1 

2011 2012 2013 2014 2015 2016 2017 

Figure 2.3 Milestones involving the application of deep learning to natural language 

Processing 



2. Named entities include places, well-known individuals, company names, and products. 
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breakthrough involving a deep learning algorithm applied to a large dataset. 3 This break- 
through happened to involve natural language data. Dahl and his team trained a deep 
neural network to recognize a substantial vocabulary of words from audio recordings of 
human speech. A year later, and as detailed already in Chapter 1, the next landmark deep 
learning feat also came out of Toronto: AlexNet blowing the traditional machine learning 
competition out of the water in the ImageNet Large Scale Visual Recognition Challenge 
(Figure 1.15). For a time, this staggering machine vision performance heralded a focus on 
applying deep learning to machine vision applications. 

By 2015, the deep learning progress being made in machine vision began to spill over 
into NLP competitions such as those that assess the accuracy of machine translations 
from one language into another. These deep learning models approached the preci- 
sion of traditional machine learning approaches; however, they required less research 
and development time while conveniently offering lower computational complexity. 
Indeed, this reduction in computational complexity provided Microsoft the opportu- 
nity to squeeze real-time machine translation Software onto mobile phone processors— 
remarkable progress for a task that previously had required an Internet connection and 
computationally expensive calculations on a remote server. In 2016 and 2017, deep learn¬ 
ing models entered into NLP competitions not only were more efficient than traditional 
machine learning models, but they also began outperforming them on accuracy. The 
remainder of this chapter starts to illuminate how. 

Computational Representations of Language 

In order for deep learning models to process language, we have to supply that language 
to the model in a way that it can digest. For all computer systems, this means a quanti- 
tative representation of language, such as a two-dimensional matrix ofnumerical values. 
Two popular methods for converting text into numbers are one-hot encoding and word 
vectors. 4 We discuss both methods in turn in this section. 

One-Hot Representations of Words 

The traditional approach to encoding natural language numerically for processing it with 
a machine is one-hot encoding (Figure 2.4). In this approach, the words of natural language 
in a sentence (e.g., “the,” “bat,” “sat,” “on,” “the,” and “cat”) are represented by the 
columns of a matrix. Each row in the matrix, meanwhile, represents a unique word. If 
there are 100 unique words across the corpus 5 ofdocuments youre feeding into your 


3. Dahl, G., et al. (2011). Large vocabulary continuous speech recognition with context-dependent DBN-HMMs. 
Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. 

4. If this were a book dedicated to NLP, then we would have been wise to also describe natural language methods 
based on word frequency, e.g., TF-IDF (term frequency-inverse document frequency) and PMI (pointwise mutual 
information). 

5. A corpus (from the Latin “body”) is the collection ofall of the documents (the “body” of language) you use as 
your input data for a given natural language application. In Chapter 11, you’11 make use of a corpus that consists of 
18 classic books. Later in that chapter, you’11 separately make use of a corpus of 25,000 film reviews. An example 
of a much larger corpus would be all of the English-language articles on Wikipedia. The largest corpuses are crawls 
of all the publicly available data on the Internet, such as at commoncrawl . org. 
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The bat sat on the cat. 

words 

the 10 0 0 10 

bat 0 10 0 0 0 

on 0 0 0 10 0 

n unique_words 

Figure 2.4 One-hot encodings of words, such as this example, predominate the 
traditional machine learning approach to natural language Processing. 

natural language algorithm, then your matrix of one-hot-encoded words will have 100 
rows. Ifthere are 1,000 unique words across your corpus, then there will be 1,000 rows in 
your one-hot matrix, and so on. 

Cells within one-hot matrices consist of binary values, that is, they are a 0 or a 1. Each 
column contains at most a single 1, but is otherwise made up of Os, meaning that one-hot 
matrices are sparse. 6 Values of one indicate the presence of a particular word (row) at a 
particular position (column) within the corpus. In Figure 2.4, our entire corpus has only 
six words, five of which are unique. Given this, a one-hot representation of the words in 
our corpus has six columns and five rows. The first unique word— the —occurs in the 
first and fifth positions, as indicated by the cells containing Is in the first row of the ma¬ 
trix. The second unique word in our wee corpus is bat, which occurs only in the second 
position, so it is represented by a value of 1 in the second row of the second column. 
One-hot word representations like this are fairly straightforward, and they are an accept- 
able format for feeding into a deep learning model (or, indeed, other machine learning 
models). As you will see momentarily, however, the simplicity and sparsity of one-hot 
representations are limiting when incorporated into a natural language application. 

Word Vectors 

Vector representations of words are the information-dense alternative to one-hot encod¬ 
ings of words. Whereas one-hot representations capture information about word location 
only, word vectors (also known as word embeddings or vector-space embeddings) capture infor¬ 
mation about word meaning as well as location. 7 This additional information renders 


6. Nonzero values are rare (i.e., they are sparse) within a sparse matrix. In contrast, dense matrices are rich in 
information: They typically contain few—perhaps even no—zero values. 

7. Strictly speaking, a one-hot representation is technically a “word vector” itself, because each column in a one-hot 
word matrix consists of a vector representing a word at a given location. In the deep learning community, however, 
use of the term “word vector” is commonly reserved for the dense representations covered in this section—that 
is, those derived by word2vec, GloVe, and related techniques. 
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word vectors favorable for a variety of reasons that are catalogued over the course of this 
chapter. The key advantage, however, is that—analogous to the visual features learned 
automatically by deep learning machine vision models in Chapter 1—word vectors enable 
deep learning NLP models to automatically learn linguistic features. 

When we’re creating word vectors, the overarching concept is that we’d like to assign 
each word within a corpus to a particular, meaningful location within a multidimensional 
space called the vector space. Initially, each word is assigned to a random location within 
the vector space. By considering the words that tend to be used around a given word 
within the natural language of your corpus, however, the locations of the words within 
the vector space can gradually be shifted into locations that represent the meaning of the 
words. 8 

Figure 2.5 uses a toy-sized example to demonstrate in more detail the mechanics 
behind the way word vectors are constructed. Commencing at the first word in our cor¬ 
pus and moving to the right one word at a time until we reach the fmal word in our 
corpus, we consider each word to be the target word. At the particular moment captured 
in Figure 2.5, the target word that happens to be under consideration is word. The next 
target word would be by, followed by the, then company, and so on. For each target 
word in turn, we consider it relative to the words around it—its context words. In our 
toy example, we’re using a context-word window size of three words. This means that 
while word is the target word, the three words to the left (a, know, and shal 1) combined 
with the three words to the right (by, company, and the) together constitute a total ofsix 
context words. 9 When we move along to the subsequent target word (by), the Windows 
of context words also shift one position to the right, dropping shal 1 and by as context 
words while adding word and it. 

context context 


K»td by the company itkeeps 


target 


Figure 2.5 A toy-sized example for demonstrating the high-level process behind 
techniques like word2vec and GloVe that convert natural language into word vectors 



8. As mentioned at the beginning of this chapter, this understanding of the meaning of a word ffom the words 
around it was proposed by Ludwig Wittgenstein. Later, in 1957, the idea was captured succinctly by the British 
linguist J.R. Firth with his phrase, “You shall know a word by the company it keeps.” Firth, J. (1957). Studies in 
linguistic analysis. Oxford: Blackwell. 

9. It is mathematically simpler and more efficient to not concern ourselves with the specific ordering of context 
words, particularly because word order tends to confer negligible extra information to the inference of word 
vectors. Ergo, we provide the context words in parentheses alphabetically, an effectively random order. 
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Two of the most popular techniques for converting natural language into word vec- 
tors are word2vec w and GloVe. n With either technique, our objective while considering 
any given target word is to accurately predict the target word given its context words. 12 
Improving at these predictions, target word after target word over a large corpus, we grad- 
ually assign words that tend to appear in similar contexts to similar locations in vector 
space. 

Figure 2.6 provides a cartoon of vector space. The space can have any number of 
dimensions, so we can call it an n-dimensional vector space. In practice, depending 
on the richness of the corpus we have to work with and the coniplexity of our NLP 
application, we niight create a word-vector space with dozens, hundreds, or—in extreme 
cases—thousands of dimensions. As overviewed in the previous paragraph, any given 
word from our corpus (e.g., ki ng) is assigned a location within the vector space. In, say, a 
100-dimensional space, the location of the word ki ng is specified by a vector that we can 
call Vking that rnust consist of 100 numbers in order to specify the location of the word 
ki ng across ali of the available dimensions. 

Human brains aren’t adept at spatial reasoning in more than three dimensions, so our 
cartoon in Figure 2.6 has only three dimensions. In this three-dimensional space, any 
given word from our corpus needs three numeric coordinates to define its location within 



n - dimensional space 

Figure 2.6 Diagram of word meaning as represented by a three-dimensional vector 

space 


10. Mikolov, T., et al. (2013). Efficient estimation ofword representations in vector space. arXiv:130i.3781. 

11. Pennington, J., et al. (2014). GloVe: Global vectors for word representations. Proceedings of the Conference on 
Empirical Methods in Natural Language Processing. 

12. Or, alternatively, we could predict context words given a target word. More on that in Chapter 11. 
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the vector space: x, y, and Z. In this cartoon example, then, the meaning of the word 
ki ng is represented by a vector Vking that consists of three numbers. If Vking is located 
at the coordinates x = —0.9, y = 1.9, and z = 2.2 in the vector space, we can use the 
annotation [-0.9, 1.9, 2.2] to describe this location succinctly. This succinet an- 
notation will come in handy shortly when we perform arithmetic operations on word 
vectors. 

The closer two words are within vector space, 13 the closer their meaning, as deter- 
mined by the similarity of the context words appearing near them in natural language. 
Synonyms and common misspellings of a given word—because they share an identical 
meaning—would be expected to have nearly identical context words and therefore nearly 
identical locations in vector space. Words that are used in similar contexts, such as those 
that denote time, tend to occur near each other in vector space. In Figure 2.6, Monday, 
Tuesday, and Wednesday could be represented by the orange-colored dots located within 
the orange days-of-the-week cluster in the cubes top-right corner. Meanwhile, months 
of the year might occur in their own purple cluster, which is adjacent to, but distinet 
from, the days of the week; they both relate to the date, but they re separate subclusters 
within a broader dates region. As a second example, we would expect to find program- 
ming languages clustering together in some location within the word-vector space that 
is distant from the time-denoting words—say, in the top-left corner. Again here, object- 
oriented programming languages like Java, C++, and Python would be expected to form 
one subcluster, while nearby we would expect to find functional programming lan¬ 
guages like Haskel 1, C1 oj ure, and Eri ang forming a separate subcluster. As you’ll see 
in Chapter 11 when you embed words in vector space yourself, less concretely defined 
terms that nevertheless convey a specific meaning (e.g., the verbs created, developed, 
and bui 11) are also allocated positions within word-vector space that enable them to be 
useful in NLP tasks. 

Word-Vector Arithmetic 

Remarkably, because particular movements across vector space turn out to be an efficient 
way for relevant word information to be stored in the vector space, these movements 
come to represent relative particular meanings between words. This is a bewildering prop- 
erty. 14 Returning to our cube in Figure 2.6, the brown arrows represent the relationship 
between countries and their capitals. That is, ifwe calculate the direction and distance 
between the coordinates ofthe words Paris and France and then trace this direction and 
distance from London, we should find ourselves in the neighborhood of the coordinate 
representing the word Engl and. As a second example, we can calculate the direction 
and distance between the coordinates for man and woman. This movement through vec¬ 
tor space represents gender and is symbolized by the green arrows in Figure 2.6. If we 
trace the green direction and distance from any given male-specific terni (e.g., ki ng, un¬ 
ci e), we should find our way to a coordinate near the terms female-specific counterpart 
(queen, aunt). 


13. Measured by Euclidean distance, which is the plain old straight-line distance between two points. 

14. One ofyour esteemed authors, Jon, prefers terms like “mind-bending” and “trippy” to describe this property 
of word vectors, but he consulted a thesaurus to narrow in on a more professional-sounding adjective. 
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V - v + V = V 

king man woman queen 

l/ - l/ + V = V 
bezos amazon tesla musk 

V - V + V = V 

Windows microsoft google android 

Figure 2.7 Examples of word-vector arithmetic 


A by-product ofbeing able to trace vectors ofmeaning (e.g., gender, capital-country 
reladonship) from one word in vector space to another is that we can perform word-vector 
arithmetic. The canonical example ofthis is as follows: Ifwe begin at Vking , the vector 
representing ki ng (continuing with our example from the preceding section, this location 
is described by [ - 0.9 , 1.9, 2.2]), subtract the vector representing man from it (lefs say 
Vman = [—1.1, 2.4, 3.0]), and add the vector representing woman (lefs say v woman = 
[-3.2, 2.5, 2.6]), we should fmd a location near the vector representing queen. To 
make this arithmetic explicit by working through it dimension by dimension, we would 
estimate the location of v queen by calculating 


tttqueen *l'king tC. man 

Vqueen — Vking Vman 

Zqueen — Zking Z man 


+ tC woma n = —0.9 + 1.1 — 3.2 = —3.0 
+ Vwoman = 1-9 — 2.4 + 2.5 = 2.0 
+ z woman = 2.2 — 3.0 + 2.6 = 1.8 


( 2 . 1 ) 


All three dimensions together, then, we expect v queen to be near [-3.0, 2.0, 1.8]. 

Figure 2.7 provides further, entertaining examples of arithmetic through a word-vector 
space that was trained on a large natural language corpus crawled from the web. As you’ll 
later observe in practice in Chapter 11, the preservation of these quantitative relationships 
of meaning between words across vector space is a robust starting point for deep learning 
models within NLP applications. 


word2viz 

To develop your intuitive appreciation of word vectors, navigate to bi t. 1 y/word2vi z. 
The default screen for the word2viz tool for exploring word vectors interactively is 
shown in Figure 2.8. Leaving the top-right dropdown box set to “Gender analogies,” 
try adding in pairs of new words under the “Modify words” heading. If you add pairs 
of corresponding gender-specific words like pri ncess and pri nce, duchess and duke, 
and busi nesswoman and busi nessman, you should find that they fall into instructive 
locations. 

The developer of the word2viz tool, Julia Bazinska, compressed a 50-dimensional 
word-vector space down to two dimensions in order to visualize the vectors on an 
xy-coordinate system. 15 For the default configuration, Bazinska scaled the x-axis from 
the words she to he as a reference point for gender, while the y-axis was set to vary 


15. We detail how to reduce the dimensionality of a vector space for visualization purposes in Chapter 11. 
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Explore word analogies 

What do you want to see? 

Geoder analogies $ 

Modify words 



Change axes labeis 


imeraclive visuattzalion of word analogies m GloVe nover lo highfighl, 
he dovtte-cM «o remove Change axes by specifyng word drfterences, on 

wblc*» you want to profect Uses (compresaed) pre-trained word vectore trom 
fltovo.ea 50d. Made by Juiia Banrtska under Ihe menlorsbip of Pwtr Migdal 
(2017). 


Figure 2.8 The default screen for word2viz, a tool for exploring word 
vectors interactively 


from a commonfolk base toward a royal peak by orienting it to the words woman and 
queen. The displayed words, placed into vector space via training on a natural language 
dataset consisting of 6 billion instances of 400,000 unique words, 16 fall relative to the two 
axes based on their meaning. The more regal (queen-like) the words, the higher on the 
plot they should be shown, and the female (she-like) ternis fall to the left of their male 
(he-like) counterparts. 

When you’ve indulged yourself sufficiently with word2viz’s “Gender analogies” 
view, you can experiment with other perspectives of the word-vector space. Selecting 
“Adjectives analogies” from the “What do you want to see?” dropdown box, you could, 
for example, add the words smal 1 and smal 1 est. Subsequently, you could change the 
a;-axis labeis to nf ce and ni cer, and then again to smal 1 and bi g. Switching to the 
“Numbers say-write analogies” view via the dropdown box, you could play around with 
changing the £-axis to 3 and 7. 

You may build your own word2viz plot from scratch by moving to the “Empty” view. 
The (word vector) world is your oyster, but you could perhaps examine the country- 
capital relationships we mentioned earlier when familiarizing you with Figure 2.6. To 
do this, set the cc-axis to range from west to east and the y-axis to city and country. 
Word pairs that fall neatly into this plot include 1 ondon—engl and, pari s — f rance, 
berl i n—germany and bei j i ng—chi na. 


While on the one hand word2viz is an enjoyable way to develop a general understand- 
ing of word vectors, on the other hand it can also be a serious tool for gaining insight 
into specific strengths or weaknesses of a given word-vector space. As an example, use 


16. Technically, 400,000 tokens —a distinction that we examine later. 
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the “What do you want to see?” dropdown box to load the “Verb tenses” view, and 
then add the words I ead and 1 ed. Doing this, it becomes apparent that the coordinates 
that words were assigned to in this vector space mirror existing gender stereotypes that 
were present in the natural language data the vector space was trained on. Switching to 
the “Jobs” view, this gender bias becomes even more stark. It is probably safe to say that 
any large natural language dataset is going to have some biases, whether intentional or 
not. The development of techniques for reducing biases in word vectors is an active 
area of research. 17 Mindful that these biases may be present in your data, however, 
the safest bet is to test your downstream NLP application in a range of situations that 
reflect a diverse userbase, checking that the results are appropriate. 


Localist Versus Distributed Representations 

With an intuitive understanding of word vectors under our figurative belts, we can con- 
trast them with one-hot representations (Figure 2.4), which have been an established 
presence in the NLP world for longer. A summary distinction is that we can say word 
vectors store the meaning of words in a distributed representation across n-dimensional 
space. That is, with word vectors, word meaning is distributed gradually— smeared —as we 
move from location to location through vector space. One-hot representations, mean- 
while, are localist. They store information on a given word discretely, within a single row 
of a typically extremely sparse matrix. 

To more thoroughly characterize the distinction between the localist, one-hot ap- 
proach and the distributed, vector-based approach to word representation, Table 2.1 com¬ 
pares them across a range of attributes. First, one-hot representations lack nuance; they 
are simple binary flags. Vector-based representations, on the other hand, are extremely 
nuanced: Within them, information about words is smeared throughout a continuous, 
quantitative space. In this high-dimensional space, there are essentially infinite possibilities 
for capturing the relationships between words. 

Second, the use of one-hot representations in practice often requires labor-intensive, 
manually curated taxonomies. These taxonomies include dictionaries and other spe- 
cialized reference language databases. 18 Such external references are unnecessary for 
vector-based representations, which are fully automatic with natural language data alone. 

Third, one-hot representations don’t handle new words well. A newly introduced 
word requires a new row in the matrix and then reanalysis relative to the existing rows 
of the corpus, followed by code changes—perhaps via reference to external information 
sources. With vector-based representations, new words can be incorporated by training 
the vector space on natural language that includes examples of the new words in their 


17. For example, Bolukbasi, T., et al. (2016). Man is to computer programmer as woman is to homemaker? 
Debiasing word embeddings. arXiv.i607.06520] Caliskan, A., et al. (2017). Semantics derived automatically from 
language corpora contain human-like biases. Science 356: 183—6; Zhang, B., et al. (2018). Mitigating unwanted 
biases with adversarial learning. arXiv: 1801.07593. 

18. For example, WordNet (wordnet.princeton.edu), which describes synonyms as well as hypernyms (“is-a” 
relationships, so furni ture, for example, is a hypernym of chair). 
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Table 2.1 Contrasting attributes of localist, one-hot representations of words with 
distributed, vector-based representations 


One-Hot 


Vector-Based 


Not subtle 
Manual taxonomies 
Handles new words poorly 
Subjective 

Word similarity not represented 


Very nuanced 
Automatic 

Seamlessly incorporates new words 
Driven by natural language data 
Word similarity = proximity in space 


natural context. A new word gets its own new n-dimensional vector. Initially, there 
may be few training data points involving the new word, so its vector might not be very 
accurately positioned within n-dimensional space, but the positioning of ali existing 
words remains intact and the model will not fail to function. Over time, as the instances 
of the new word in natural language increases, the accuracy of its vector-space coordinates 
will improve. 19 

Fourth, and following from the previous two points, the use of one-hot representa¬ 
tions often involves subjective interpretations of the meaning of language. This is because 
they often require coded rules or reference databases that are designed by (relatively small 
groups of) developers. The meaning of language in vector-based representations, mean- 
while, is data driven. 2 " 

Fifth, one-hot representations natively ignore word similarity: Similar words, such as 
couch and sofa, are represented no differently than unrelated words, such as couch and 
cat. In contrast, vector-based representations innately handle word similarity: As men- 
tioned earlier with respect to Figure 2.6, the more similar two words are, the closer they 
are in vector space. 

Elements of Natural Human Language 

Thus far, we have considered only one element of natural human language: the word. 
Words, however, are made up of constituent language elements. In turn, words them- 
selves are the constituents of more abstract, more complex language elements. We begin 
with the language elements that make up words and build up from there, following 
the schematic in Figure 2.9. With each element, we discuss how it is typically encoded 
from the traditional machine learning perspective as well as from the deep learning per- 
spective. As we move through these elements, notice that the distributed deep learning 


19. An associated problem not addressed here occurs when an in-production NLP algorithm encounters a word 
that was not included within its corpus of training data. This out of vocahulary problem impacts both one-hot 
representations and word vectors. There are approaches—such as Facebook s fastText library—that try to get around 
the issue by considering subword information, but these approaches are beyond the scope of this book. 

20. Noting that they may nevertheless include biases found in natural language data. See the sidebar beginning on 
page 31. 
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morphemes 



phonemes 



syntax —> semantics 


abstractness and complexity 

Figure 2.9 Relationships between the elements of natural human language. The 
leftmost elements are building blocks for further-right elements. As we move to the right, 
the more abstract the elements become, and therefore the more complex they are to 
model within an NLP application. 

representations are fluid and flexible vectors whereas the traditional ML representations 
are local and rigid (Table 2.2). 

Phonology is concerned with the way that language sounds when it is spoken. Every 
language has a specific set of phonemes (sounds) that make up its words. The traditional 
ML approach is to encode segments of auditory input as specific phonemes from the 
language’s range of available phonemes. With deep learning, we train a model to predict 
phonemes from features automatically learned from auditory input and then represent 
those phonemes in a vector space. In this book, we work with natural language in text 
format only, but the techniques we cover can be applied directly to speech data if youre 
keen to do so on your own time. 

Morphology is concerned with the forms of words. Like phonemes, every language has 
a specific set of morphemes, which are the smallest units of language that contain sorne 
meaning. For example, the three morphemes out, go, and i ng combine to form the word 
outgoing. The traditional ML approach is to identify morphemes in text from a list of ali 
the morphemes in a given language. With deep learning, we train a model to predict the 
occurrence of particular morphemes. Hierarchically deeper layers of artificial neurons can 
then combine multiple vectors (e.g., the three representing out, go, and i ng) into a single 
vector representing a word. 


Table 2.2 Traditional machine learning and deep learning representations, by 
natural language element 


Representation 

Traditional ML 

Deep Learning 

Audio-Only 

Phonology 

AII phonemes 

Vectors 

True 

Morphology 

AII morphemes 

Vectors 

False 

Words 

One-hot encoding 

Vectors 

False 

Syntax 

Phrase rules 

Vectors 

False 

Semantics 

Lambda calculus 

Vectors 

False 
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Phonemes (when considering audio) and morphemes (when considering text) com- 
bine to form words. Whenever we work with natural language data in this book, we 
work at the word level. We do this for four reasons. First, it s straightforward to define 
what a word is, and everyone is familiar with what they are. Second, it’s easy to break 
up natural language into words via a process called tokenization 21 that we work through 
in Chapter 11. Third, words are the most-studied level of natural language, particularly 
with respect to deep learning, so we can readily apply cutting-edge techniques to them. 
Fourth, and perhaps most critically, for the NLP models we’ll be building, word vectors 
simply work well: They prove to be functional, efficient, and accurate. In the preceding 
section, we detail the shortcomings oflocalist, one-hot representations that predominate 
traditional ML relative to the word vectors used in deep learning models. 

Words are combined to generate syntax. Syntax and morphology together constitute 
the entirety of a language's grammar. Syntax is the arrangement of words into phrases 
and phrases into sentences in order to convey meaning in a way that is consistent across 
the users of a given language. In the traditional ML approach, phrases are bucketed into 
discrete, formal linguistic categories. 22 With deep learning, we employ vectors (surprise, 
surprise!). Every word and every phrase in a section of text can be represented by a vector 
in ?i-dimensional space, with layers of artificial neurons combining words into phrases. 

Semantics is the most abstract of the elements of natural language in Figure 2.9 and 
Table 2.2; it is concerned with the meaning of sentences. This meaning is inferred from all 
the underlying language elements like words and phrases, as well as the overarching con- 
text that a piece of text appears in. Inferring meaning is complex because, for example, 
whether a passage is supposed to be taken literally or as a humorous and sarcastic remark 
can depend on subtle contextual differences and shifting cultural norms. Traditional ML, 
because it doesn’t represent the fuzziness of language (e.g., the similarity ofrelated words 
or phrases), is limited in capturing semantic meaning. With deep learning, vectors come 
to the rescue once again. Vectors can represent not only every word and every phrase in 
a passage of text but also every logical expression. As with the language elements already 
covered, layers of artificial neurons can recombine vectors of constituent elements—in 
this case, to calculate semantic vectors via the nonlinear combination of phrase vectors. 


Google Duplex 

One of the more attention-grabbing examples of deep-learning-based NLP in recent 
years is that of the Google Duplex technology, which was unveiled at the companys I/O 
developers conference in May 2018. The search giant’s CEO, Sundar Pichai, held spec- 
tators in rapture as he demonstrated Google Assistant making a phone call to a Chinese- 
food restaurant to book a reservation. The audible gasps from the audience were in 
response to the natural flow of Duplex’s conversation. It had mastered the cadence ofa 
human conversation, replete with the uh’s and hhlun s that we sprinkle into conversations 


21. Essentially, tokenization is the use of characters like commas, periods, and whitespace to assume where one 
word ends and the next begins. 

22. These categories have names like “noun-phrase” and “verb-phrase.” 
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while we’re thinking. Furthermore, the phone call was of average audio quality and the 
human on the line had a strong accent; Duplex never faltered, and it managed to make 
the booking. 

Bearing in mind that this is a demonstration—and not even a live one—what nev- 
ertheless impressed us was the breadth of deep learning applications that had to come 
together to facilitate this technology. Considet the flow of information back and forth 
between the two agents on the call (Duplex and the restaurateur): Duplex needs a so- 
phisticated speech recognition algorithm that can process audio in real time and handle 
an extensive range of accents and call qualities on the other end of the line, and also 
overcome the background noise. 23 

Once the humans speech has been faithfully transcribed, an NLP model needs to 
process the sentence and decide what it means. The intention is that the person on the 
line doesn’t know theyre speaking to a computer and so doesn’t need to modulate their 
speech accordingly, but in turn, this means that humans respond with complex, multipart 
sentences that can be tricky for a computer to tease apart: 

“We don’t have anything tomorrow, but we have the next day and Thursday, anytime 
before eight. Wait no . . . Thursday at seven is out. But we can do it after eight?” 

This sentence is poorly structured—you’d never write an email like this—but in natural 
conversation, these sorts of on-the-fly corrections and replacements happen regularly, and 
Duplex needs to be able to follow along. 

With the audio transcribed and the meaning of the sentence processed, Duplexs NLP 
model conjures up a response. This response must ask for more information if the human 
was unclear or if the answers were unsatisfactory; otherwise, it should confirm the book¬ 
ing. The NLP model will generate a response in text forni, so a text-to-speech (TTS) 
engine is required to synthesize the sound. 


Duplex uses a combination of de novo waveform synthesis using Tacotron 24 and 
WaveNet, 25 as well as a more classical “concatenative” text-to-speech engine. 26 This 
is where the system crosses the so-called uncanny valley: 27 The voice heard by the 


23. This is known as the “cocktail-party problem”—or less jovially, “multitalker speech separatiori.” It’s a problem 
that humans solve innately, isolating single voices from a cacophony quite well without explicit instruction on how 
to do so. Machines typically struggle with this, although a variety of groups have proposed Solutions. For example, 
see Simpson, A., et al. (2015). Deep karaoke: Extracting vocals from musical mixtures using a convolutional 
deep neural network. arXiv: 1504.04658; Yu, D., et al. (2016). Permutation invariant training of deep models for 
speaker-independent multi-talker speech separation. arXiv: 1607.00325. 

24. bit.1y/tacotron 

25. bit.1y/waveNet 

26. Concatenative TTS engines use vast databases of prerecorded words and snippets, which can be strung together 
to form sentences. This approach is common and fairly easy, but it yields stilted, unnatural speech and cannot adapt 
the speed and intonation; you can’t modulate a word to make it sound as if a question is being asked, for example. 

27. The uncanny valley is a perilous space wherein humans find humanlike simulations weird and creepy because 
theyre too similar to real humans but are clearly not real humans. Product designers endeavor to avoid the uncanny 
valley. Theyve learned that users respond well to simulations that are either very robotic or not robotic at ali. 
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restaurateur is not a human voice at ali. WaveNet is able to generate completely 
synthetic waveforms, one sample at a time, using a deep neural network trained 
on real waveforms from human speakers. Beneath this, Tacotron maps sequences of 
words to corresponding sequences of audio features, which capture subtleties of human 
speech such as pitch, speed, intonation, and even pronunciation. These features are 
then fed into WaveNet, which synthesizes the actual waveform that the restaurateur 
hears. This whole system is able to produce a natural-sounding voice with the correct 
cadence, emotion, and emphasis. During more-or-less rote moments in the conversa- 
tion, the simple concatenative TTS engine (composed ofrecordings ofits oum “voice”), 
which is less computationally demanding to execute, is used. The entire model 
dynamically switches between the various models as needed. 


To misquote Jerry Maguire, you had all of this at “hello.” The speech recognition 
system, NLP models, and TTS engine all work in concert from the instant the call is an- 
swered. Things only stand to get more complex for Duplex from then on. Governing all 
of this interaction is a deep neural network that is specialized in handling information that 
occurs in a sequence. 28 This governor tracks the conversation and feeds the various inputs 
and outputs into the appropriate models. 

It should be ciear from this overview that Google Duplex is a sophisticated system 
of deep learning models that work in harmony to produce a seamless interaction on the 
phone. For now, Duplex is nevertheless limited to a few specific domains: scheduling 
appointments and reservations. The system cannot carry out general conversations. So 
even though Duplex represents a significant step forward for artificial intelligence, there 
is stili much work to be done. 

Summary 

In this chapter, you learned about applications of deep learning to the processing of nat- 
ural language. To that end, we described further the capacity for deep learning models 
to automatically extract the most pertinent features from data, removing the need for 
labor-intensive one-hot representations of language. Instead, NLP applications involving 
deep learning make use of vector-space embeddings, which capture the meaning of words 
in a nuanced manner that improves both model performance and accuracy. 

In Chapter 11, you'11 construet an NLP application by making use of artificial neural 
networks that handle the input of natural language data all the way through to the output 
of an inference about those data. In such “end-to-end” deep learning models, the initial 
layers create word vectors that flow seamlessly into deeper, specialized layers of artificial 
neurons, including layers that incorporate “memory.” These model architectures highlight 
both the strength and the ease of use of deep learning with word vectors. 


28. Called a recurrent neural network. These feature in Chapter 11. 
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Machine Art 


I n this chapter, we introduce some of the concepts that enable deep learning mod- 
els to seemingly create art, an idea that may be paradoxical to some. The University of 
California, Berkeley, philosopher Alva Noe, for one, opined, “Art can help us frame a 
better picture of our human nature.” 1 If this is true, how can machines create art? Or 
put differently, are the creations that emerge from these machines, in fact, art? Another 
interpretation—and one we like best—is that these creations are indeed art and that 
programmers are artists wielding deep learning models as brushes. Were not the only 
ones who view these works as bona fide artistry: generative adversarial network (GAN)- 
produced paintings have been snapped up to the tune of $400,000 a pop. 2 

Over the course of this chapter, we cover the high-level concepts behind GANs, and 
you will see examples of the novel visual works they can produce. We will draw a link 
between the latent spaces associated with GANs and the word-vector spaces of Chapter 2. 
And we will cover a deep learning model that can be used as an automated tool for 
dramatically improving the quality of photos. But before we do any of that, let’s grab a 
drink . . . 

A Boozy All-Nighter 

Below Googles offices in Montreal sits a bar called Les 3 Brasseurs, a moniker that trans- 
lates from French to “The 3 Brewers.” It was at this watering hole in 2014, while a PhD 
student in Yoshua Bengios renowned lab (Figure 1.10), that Ian Goodfellow conceived 
of an algorithm for fabricating reahstic-looking images, 3 a technique that Yann LeCun 
(Figure 1.9) has hailed as the “most important” recent breakthrough in deep learning. 4 

Goodfellows friends described to him a generati pe model they were working on, that is, 
a computational model that aims to produce something novel, be it a quote in the style 
of Shakespeare, a musical melody, or a work of abstract art. In their particular case, the 


1. Noe, A. (2015, October 5). What art unveils. The New York Times. 

2. Cohn, G. (2018, October 25). AI art at Christie s sells for $432,500. The New York Times. 

3. Giles, M. (2018, February 21). The GANfather: The man whos given machines the gift of imagination. MIT 
Technology Review. 

4. LeCun, Y. (2016, July 28). Quora. bi t. 1 y/DLbreakthru 
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friends were attempting to design a model that could generate photorealistic images such 
as portraits of human faces. For this to work well via the traditional machine learning 
approach (Figure 1.12), the engineers designing the model would need to not only cata- 
log and approximate the critical individual features of faces like eyes, noses, and mouths, 
but also accurately estimate how these features should be arranged relative to each other. 
Thus far, their results had been underwhelming. The generated faces tended to be 
excessively blurry, or they tended to be missing essential elements like the nose or 
the ears. 

Perhaps with his creativity heightened by a pint of beer or two, 5 Goodfellow pro- 
posed a revolutionary idea: a deep learning model in which two artificial neural networks 
(ANNs) act against each other competitively as adversaries. As illustrated in Figure 3.1, 
one of these deep ANNs would be programmed to produce forgeries while the other 
would be programmed to act as a detective and distinguish the fakes from real images 
(which would be provided separately). These adversarial deep learning networks would 
play off one another: As the generator became better at producing fakes, the discriminator 
would need to become better at identifying them, and so the generator would need to 
produce even more compelling counterfeits, and so on. This virtuous cycle would even- 
tually lead to convincing novel images in the style of the real training images, be they of 
faces or otherwise. Best of ali, Goodfellow’s approach would circumnavigate the need to 
program features into the generative model manually. As we expounded with respect to 





Real Art 


Generator 


Latent Space 
Coordinates 


Figure 3.1 High-level schematic of a generative adversaria! network (GAN). Real 
images, as well as forgeries produced by the generator, are provided to the discriminator, 
which is tasked with identifying which are the genuine articles. The orange cloud 
represents latent space (Figure 3.4) “guidance” that is provided to the forger. This 
guidance can either be random (as is generally the case during network training; see 
Chapter 12) or selective (during post-training exploration, as in Figure 3.3). 


5.Jarosz, A., et al. (2012). Uncorking the muse: Alcohol intoxication facilitates Creative problem solving. Con- 
sciousness and Cognition, 21, 487—93. 
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Figure 3.2 Results presented in Goodfellow and colleagues’ 2014 GAN paper 


machine vision (Chapter 1) and natural language processing (Chapter 2), deep learning 
would sort out the models features automatically. 

Goodfellows friends were doubtful his imaginative approach would work. So, when 
he arrived home and found his girlfriend asleep, he worked late to architect his dual- 
ANN design. It worked the first time, and the astounding deep learning family of genera- 
tive adversarial networks was born! 

That sanie year, Goodfellow and his colleagues revealed GANs to the world at the 
prestigious Neural Information Processing Systems (NIPS) conference. 6 Some of their 
results are shown in Figure 3.2. Their GAN produced these novel images by being trained 
on (a) handwritten digits; 7 (b) photos of human faces; 8 and (c) and (d) photos from across 
ten diverse classes (e.g., planes, cars, dogs).° The results in (c) are markedly less crisp than 
in (d), because the GAN that produced the latter featured neuron layers specialized for 
machine vision called convolutional layers, 10 whereas the GAN that produced the former 
used a more general layer type only. * 11 

Arithmetic on Fake Human Faces 


Following on from Goodfellow’s lead, a research team led by the American machine 
learning engineer Alee Radford determined architectural constraints for GANs that guide 
considerably more realistic image creation. Some examples of portraits of fake humans 


6. Goodfellow, I., et al. (2014). Generative adversarial networks. arXivA406.266t. 

7. From LeCuns classic MNIST dataset, which we use ourselves in Part II. 

8. From the Hinton (Figure 1.16) research groups Toronto Face database. 

9. The CIFAR-10 dataset, which is named after the Canadian Institute for Advanced Research that supported its 
creation. 

10. We detail these in Chapter 10. 

11. Dense layers, which are introduced in Chapter 4 and detailed in Chapter 7. 
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Figure 3.3 An example of latent-space arithmetic from Radford et al. (2016) 


that were produced by their deep convolutional GANs 12 are provided in Figure 3.3. In their 
paper, Radford and his teammates cleverly demonstrated interpolation through, and arith¬ 
metic with, the latent space associated with GANs. Lets start offby explaining what latent 
space is before moving on to latent-space interpolation and arithmetic. 

The latent-space cartoon in Figure 3.4 may be reminiscent of the word-vector space 
cartoon in Figure 2.6. As it happens, there are three major similarities between latent 
spaces and vector spaces. First, while the cartoon is only three-dimensional for simplicity 
and comprehensibility, latent spaces are n-dimensional spaces, usually in the order of 
hundreds of dimensions. The latent space of the GAN you'11 later architect yourself in 
Chapter 12, for example, will have n = 100 dimensions. Second, the closer two points 
are in the latent space, the more similar the images that those points represent. And third, 
movement through the latent space in any particular direction can correspond to a gradual 
change in a concept being represented, such as age or gender for the case of photorealistic 
faces. 

By picking two points far away from each other along sorne n-dimensional axis rep- 
resenting age, interpolating between them, and sampling points from the interpolated 
line, we could find what appears to be the sanie (fabricated) man gradually appearing 
to be older and older. 13 In our latent-space cartoon (Figure 3.4), we represent such an 
“age” axis in purple. To observe interpolation through an authentic GAN latent space, we 
recommend scanning through Radford and colleagues’ paper for, as an example, smooth 
rotations ofthe “photo angle” ofsynthetic bedrooms. At the time ofwriting, the state 


12. Radford, A., et al. (2016). Unsupervised representation learning with deep convolutional generative adversarial 
networks. arXiv: 1511.06434v2. 

13. A technical aside: As is the case with vector spaces, this “age” axis (or any other direction within latent space 
that represents some meaningful attribute) may be orthogonal to all of the n dimensions that constitute the axes 
of the n-dimensional space. We discuss this further in Chapter 11. 
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n - dimensional space 


Figure 3.4 A cartoon of the latent space associated with generative adversarial 
networks (GANs). Moving along the purple arrow, the latent space corresponds to images 
of a similar-looking individual aging. The green arrow represents gender, and the orange 
one represents the inclusion of glasses on the face. 


of the art in GANs can be viewed at bi t. 1y/InterpCel eb. This video, produced by re- 
searchers at the graphics-card manufacturer Nvidia, provides a breathtaking interpolation 
through high-quality portrait “photographs” of ersatz celebrities. 14,15 

Moving a step further with what youve learned, you could now perforni arithmetic 
with images sampled from a GANs latent space. When sampling a point within the 
latent space, that point can be represented by the coordinates of its location—the resulting 
vector is analogous to the word vectors described in Chapter 2. As with word vectors, 
you can perforni arithmetic with these vectors and move through the latent space in a 
semantic way. Figure 3.3 showcases an instance of latent-space arithmetic from Radford 
and his coworkers. Starting with a point in their GANs latent space that represents a man 
with glasses, subtracting a point that represents a man without glasses, and adding a point 


14. Karras, T., et al. (2018). Progressive growing of GANs for improved quality, stability, and variation. Proceedings 
of the International Conference on Learning Representations. 

15. To try your hand at distinguishing between real and GAN-generated faces, visit whi chf acei sreal . com. 
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representing a woman without glasses, the resulting point exists in the latent space near to 
images that represent women with glasses. Our cartoon in Figure 3.4 illustrates how the 
relationships between meaning in latent space are stored (again, akin to the way they are 
in word-vector space), thereby facilitating arithmetic on points in latent space. 

Style Transfer: Converting Photos into Monet 
(and Vice Versa) 

One of the more magical applications of GANs is style transfer. Zhu, Park, and their 
coworkers from the Berkeley Artificial Intelligence Research (BAIR) Lab introduced 
a new flavor of GAN 16 that enables stunning examples of this, as shown in Figure 3.5. 
Alexei Efros, one of the paper s coauthors, took photos while on holiday in France 
and the researchers employed their CycleGAN to transfer these photos into the style 
of the Impressionist painter Claude Monet, the nineteenth-century Dutch artist 
Vincent Van Gogh, and the Japanese Ukiyo-e genre, among others. Ifyou navigate to 
bi t. t y/cyct eGAN, you’11 be delighted to discover instances of the inverse (Monet paint- 
ings converted into photoreahstic images), as well as: 

■ Surnmer scenes converted into wintry ones, and vice versa 

■ Baskets of apples converted into baskets of oranges, and vice versa 



Figure 3.5 Photos converted into the styles of well-known painters by CycleGANs 


16. Called “CycleGANs” because they retain image consistency over multiple cycles of network training. 
Zhu, J.-Y., et al. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. 
arXiv:1703.10593. 
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■ Flat, low-quality photos converted into what appear to be ones captured by high- 
end (single-lens reflex) cameras 

■ A video of a horse running in a field converted into a zebra 

■ A video of a drive taken during the day converted into a nighttime one 

Make Your Own Sketches Photorealistic 


Another GAN application out of Alexei Efross BAIR lab, and one that you can amuse 
yourself with straightaway, is pix2pix . 17 If you make your way to bit.ly/pix2pi xDemo, 
you can interactively translate images from one type to another. Using the edges2cats 
tool, for example, we sketched the three-eyed cat in the left-hand panel of Figure 3.6 to 
generate the photorealistic(-ish) mutant kitty in the right-hand panel. As it takes your 
fancy, you are also welcome to convert your own Creative visions of felines, shoes, hand- 
bags, and building facades into photorealistic analogs within your browser. The authors 
of the pix2pix paper call their approach a conditional GAN (cGAN for short) because the 
generative adversarial network produces an output that is conditional on the particular 
input provided to it. 

Creating Photorealistic Images from Text 

To round out this chapter, we’d like you to take ngander at the truly photorealistic 
high-resolution images in Figure 3.7. These images were generated by StackGAN, 18 an 



Figure 3.6 A mutant three-eyed cat (right-hand panel) synthesized via the pix2pix web 
application. The sketch in the left-hand panel that the GAN output was conditioned on 
was clearly not doodled by this book’s illustrator, Aglae, but one of its other authors (who 

shall remain nameless). 


17. Isola, P., et al. (2017). Image-to-image translation with conditional adversarial networks. arXiv: 1611.07004. 

18. Zhang, H., et al. (2017). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial 
networks. arXiv: 1612.03242v2. 
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(a) Stage-I 
images 


(b) Stage-I I 
images 


This bird has a yellow 
belly and tarsus, grey 
back, wings, and 
brown throat, nape 
with a black face 


This bird is white 
with some black on 
its head and wings, 
and has a long 
orange beak 


This flower has 
overlapping pink 
pointed petals 
surrounding a ring of 
short yellow Tilaments 



Figure 3.7 Photorealistic high-resolution images output by StackGAN, which involves 
two GANs stacked upon each other 


approach that stacks two GANs on top of each other. The first GAN in the architecture is 
configured to produce a rough, low-resolution image with the general shape and colors of 
the relevant objects in place. This is then supplied to the second GAN as its input, where 
the forged “photo” is refmed by fixing up imperfections and adding considerable detail. 
The StackGAN is a cGAN like the pix2pix network in the preceding section; however, 
the image output is conditioned on text input instead of an image. 

Image Processing Using Deep Learning 

Since the advent of digital camera technology, image processing (both on-device and 
postprocessing) has become a staple in most (if not all) photographers’ workflows. This 
ranges ffom simple on-device processing, such as increasing saturation and sharpness im- 
mediately after capture, to complex editing of raw image files in Software applications like 
Adobe Photoshop and Lightroom. 

Machine learning has been used extensively in on-device processing, where the camera 
manufacturer would like the image that the consumer sees to be vibrant and pleasing to 
the eye with minimal user effort. Some examples of this are: 

■ Early face-recognition algorithms in point-and-shoot cameras, which optimize the 
exposure and focus for faces or even selectively fire the shutter when they recognize 
that the subject is smiling (as in Figure 1.13) 

■ Scene-detection algorithms that adjust the exposure settings to capture the white- 
ness of snow or activate the flash for nighttime photos 

In the postprocessing arena a variety of automatic tools exists, although generally pho¬ 
tographers who are taking the time to postprocess images are investing considerable time 
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and domain-specific knowledge into color and exposure correction, denoising, sharpen- 
ing, tone mapping, and touching up (to name just a few of the corrections that may be 
applied). 

Historically, these corrections have been difficult to execute programmatically, 
because, for example, denoising might need to be applied selectively to different im- 
ages and even different parts of the sanie image. This is exactly the kind of intelligent 
application that deep learning is poised to excel at. 

In a 2018 paper from Chen Chen and his collaborators at Intel Labs, 19 deep learning 
was applied to the enhancement of images that were taken in near total darkness, with 
astonishing results (Figure 3.8). In a phrase, their deep learning model involves convolu- 
tional layers organized into the innovative U-Net 2() architecture (which we break down 
for you in Chapter 10). The authors generated a custom dataset for training this model: 
the See-in-the-Dark dataset consists of 5,094 raw images ofvery dark scenes using a 
short-exposure image 21 with a corresponding long-exposure image (using a tripod for 
stability) ofthe same scene. The exposure times on the long-exposure images were 100 
to 300 times those of the short-exposure training images, with actual exposure times in 
the range of 10 to 30 seconds. As demonstrated in Figure 3.8, the deep-learning-based 
image-processing pipeline ofU-Net (right panel) far outperforms the results ofthe tradi- 
tional pipeline (center panel). There are, however, hmitations as yet: 

■ The model is not fast enough to perform this correction in real time (and certainly 
not on-device); however, runtime optimization could help here. 

■ A dedicated network must be trained for different camera-models and sensors, 
whereas a more generalized and camera-model-agnostic approach would be favor- 
able. 

■ While the results far exceed the capabilities of traditional pipelines, there are stili 
some artifacts present in the enhanced photos that could stand to be improved. 

■ The dataset is limited to selected static scenes and needs to be expanded to other 
subjects (most notably, humans). 



Figure 3.8 A sample image (left) processed using a traditional pipeline (center) and 
the deep learning pipeline by Chen et al. (right) 


19. Chen, C., et al. (2018) Learning to see in the dark. arXiv:1805.01934. 

20. Ronneberger et al. (2015) U-Net: Convolutional networks for biomedical image segmentation. arXiv: 
1505.04597. 

21. That is, a short enough exposure time to enable practical handheld capture without motion blur but that 
renders images too dark to be useful. 
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Limitations aside, this work nevertheless provides a beguiling peek into how deep learn- 
ing can adaptively correct images in photograph postprocessing pipelines with a level of 
sophistication not before seen from machines. 

Summary 

In this chapter, we introduced GANs and conveyed that this deep learning approach 
encodes exceptionally sophisticated representations within their latent spaces. These rich 
visual representations enable GANs to create novel images with particular, granular artistic 
styles. The outputs of GANs aren’t purely aesthetic; they can be practical, too. They can, 
as examples, simulate data for training autonomous vehicles, hurry the pace of prototyp- 
ing in the fields of fashion and architecture, and substantially augment the capacities of 
Creative humans. 22 

In Chapter 12, after we get ali of the prerequisite deep learning theory out of the way, 
you'11 construet a GAN yourself to imitate sketehes from the Quick, Draw! dataset (intro¬ 
duced at the end of Chapter 1). Take a gander at Figure 3.9 for a preview of what you’ll 
be able to do. 




Figure 3.9 Novel “hand drawings” of apples produced by the GAN architecture 
we develop together in Chapter 12. Using this approach, you can produce 
machine-drawn “sketehes” from across any of the hundreds of categories involved in the 

Quick, Draw! game. 


22. Carter, S., and Nielsen, M. (2017, December 4). Using artificial intelligence to augment human intelligence. 
Distill. di stili . pub/2017/aia 
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Game-Playing Machines 


A 

/xlongside the generative adversarial networks introduced in Chapter 3, deep rein- 
forcement learning has produced sorne of the most surprising artificial-neural-network 
advances, including the lions share of the headline-grabbing “artificial intelligence” 
breakthroughs of recent years. In this chapter, we introduce what reinforcement learning 
is as well as how its fusion with deep learning has enabled machines to meet or surpass 
human-level performance on a diverse range of complex challenges, including Atari video 
games, the board game Go, and subtle physical-manipulation tasks. 

Deep Learning, AI, and Other Beasts 

Earlier in this book, we introduced deep learning with respect to vision (Chapter 1), 
language (Chapter 2), and the generation ofnovel “art” (Chapter 3). In doing this, we’ve 
loosely alluded to deep learnings relationship to the concept of artificial intelligence. At 
this stage, as we begin to cover deep reinforcement learning, it is worthwhile to defme 
these ternis more thoroughly as well as the terms’ relationships to one another. As usual, 
we will be assisted by visual cues—in this case, the Venn diagram in Figure 4.1. 

Artificial Intelligence 

Artificial intelligence is the buzziest, vaguest, and broadest of the terms we cover in this 
section. Taking a stab at a technical definition regardless, a decent one is that AI involves 
a machine processing information from its surrounding environment and then factoring 
that information into decisions toward achieving sonie desired outcome. Perhaps given 
this, some consider the goal of AI to be the achievement of “general intelligence”— 
intelligence as it is generally referred to with respect to broad reasoning and problem- 
solving capabilities. 1 In practice and particularly in the popular press, “AI” is used to 
describe any cutting-edge machine capability. Presently, these capabilities include voice 
recognition, describing what s happening in a video, question-answering, driving a car, 


1. Defining “intelligence” is not straightforward, and the great debate on it is beyond the scope of this book. 
A century-old definition of the term that we find amusing and that stili today has some proponents among con- 
temporary experts is that “intelligence is whatever IQ tests measure.” See, for example, van der Mass, H., et al. 
(2014). Intelligence is what the intelligence test measures. Seriously. Journal of Intelligence, 2, 12—15. 
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Figure 4.1 Venn diagram showing the relative positioning of the major concepts 
covered over the course of this book 


industrial robots that mimic human exemplars in the factory, and dominating humans 
at “intuition-heavy” board games like Go. Once an AI capability becomes conimon- 
place (e.g., recognizing handwritten digits, which was cutting-edge in the 1990s; see 
Chapter 1), the “AI” moniker is typically dropped by the popular press for that capability 
such that the goalposts on the definition of AI are always moving. 

Machine Learning 

Machine learning is a subset of AI alongside other facets of AI like robotics. Machine learn¬ 
ing is a field of computer Science concerned with setting up Software in a manner so that 
the Software can recognize patterns in data without the programmer needing to explicitly 
dictate how the Software should carry out ali aspects of this recognition. That said, the 
programmer would typically have some insight into or hypothesis about how the problem 
might be solved, and would thereby provide a rough model framework and relevant data 
such that the learning Software is well prepared and well equipped to solve the problem. 
As depicted in Figure 1.12 and discussed time and again within the earlier chapters ofthis 
book, machine learning traditionally involves cleverly—albeit manually, and therefore 
laboriously—processing raw inputs to extract features that jibe well with data-modeling 
algorithms. 
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Representation Learning 

Peeling back another layer of the Figure 4.1 onion, we fmd representation learning. 

This term was introduced at the start of Chapter 2, so we don’t go into much detail 
again here. To recap briefly, representation learning is a branch of machine learning 
in which models are constructed in a way that—provided they are fed enough data— 
they learn features (or representations) automatically. These learned features may wind up 
being both more nuanced and more comprehensive than their manually curated cousins. 
The trade-off is that the learned features might not be as well understood nor as straight- 
forward to explain, although academic and industrial researchers alike are increasingly 
tackling these hitches. 2 

Artificial Neural Networks 

Artificial neural networks (ANNs) dominate the field of representation learning today. As 
was touched on in earher chapters and will be laid bare in Chapter 6, artificial neurons are 
simple algorithms inspired by biological brain cells, especially in the sense that individual 
neurons—whether biological or artificial—receive input from many other neurons, per- 
form some computation, and then produce a single output. An artificial neural network, 
then, is a collection of artificial neurons arranged so that they send and receive informa- 
tion between each other. Data (e.g., images of handwritten digits) are fed into an ANN, 
which processes these data in some way with the goal of producing some desired resuit 
(e.g., an accurate guess as to what digits are represented by the handwriting). 

Deep Learning 

Of all the terms in Figure 4.1, deep learning is the easiest to define because it s so precise. 
We have mentioned a couple of times already in this book that a network composed of at 
least a few layers of artificial neurons can be called a deep learning network. As exempli- 
fied by the classic architectures in Figures 1.11 and 1.17; diagramed simply in Figure 4.2; 
and fleshed out fully in Chapter 7, deep learning networks have a total of five or more 
layers with the following structure: 

■ A single input layer that is reserved for the data being fed into the network. 

■ Three or more hidden layers that learn representations from the input data. A 
general-purpose and frequently used type of hidden layer is the dense type, in 
which all of the neurons in a given layer can receive information from each of the 
neurons in the previous layer (it is apt, then, that a common synonym for “dense 
layer” is fully connected layer). In addition to this versatile hidden-layer type, there 
is a cornucopia of specialized types for particular use cases; we touch on the most 
popular ones as we make our way through this section. 

■ A single output layer that is reserved for the values (e.g., predictions) that the net¬ 
work yields. 


2. For example, see Kindermans, P.-J., et al. (2018). Learning how to explain neural networks: PatternNet and 
PatternAttribution. International Conference on Learning Representations. 
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Figure 4.2 Generalization of deep learning model architectures 


With each successive layer in the network being able to represent increasingly abstract, 
nonlinear recombinations of the previous layers, deep learning models with fewer than 
a dozen layers of artificial neurons are often sufficient for learning the representations 
that are of value for a given problem being solved with a given dataset. That said, deep 
learning networks with hundreds or even upwards of a thousand layers have in occasional 
circumstances been demonstrated to provide value. 3 

As rapidly improving accuracy benchmarks and countless conipetition wins since 
AlexNets 2012 victory in the ILSVRC (Figure 1.15) have demonstrated, the deep learn¬ 
ing approach to modeling excels at a broad range ofmachine learning tasks. Indeed, with 
deep learning driving so much of the contemporary progress in AI capabilities, the words 
“deep learning” and “artificial intelligence” are used essentially interchangeably by the 
popular press. 

Let’s move inside the deep learning ring of Figure 4.1 to explore classes of tasks that 
deep learning algorithms are leveraged for: machine vision, natural language processing, 
and reinforcement learning. 

Machine Vision 

Via analogy to the biological vision system, Chapter 1 introduced machine vision. There 
we focused on object recognition tasks such as distinguishing handwritten digits or breeds 
of dogs. Other prominent examples of applications that involve machine vision algorithms 
include self-driving cars, face-tagging suggestions, and phone unlocking via face recog¬ 
nition on smartphones. More broadly, machine vision is relevant to any AI that is going 
to need to recognize objects by their appearance at a distance or navigate a real-world 
environment. 

Convolutional neural networks (ConvNets or CNNs for short) are a prominent type of 
deep learning architecture in contemporary machine vision applications. A CNN is any 
deep learning model architecture that features hidden layers of the convolutional type. We 


3. For example, see He, K., et al. (2016). Identity mappings in deep residual networks. arXiv:1603.0502 7. 
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mentioned convolutional layers with respect to Ian Goodfellow's generative adversarial 
network results in Figure 3.2; we will detail and deploy them in Chapter 10. 

Natural Language Processing 

In Chapter 2, we covered language and natural language processing. Deep learning 
doesn’t dominate natural language applications as comprehensively as it does machine 
vision applications, so our Venn diagram in Figure 4.1 shows NLP in both the deep 
learning region as well as the broader machine learning territory. As depicted by the 
timeline in Figure 2.3, however, deep learning approaches to NLP are beginning to over- 
take traditional machine learning approaches in the field with respect to both efficiency 
and accuracy. Indeed, in particular NLP areas like voice recognition (e.g., Amazons 
Alexa or Googles Assistant), machine translation (including real-time voice translation 
over the phone), and aspects of Internet search engines (like predicting the characters or 
words that will be typed next by a user), deep learning already predominates. More gen- 
erally, deep learning for NLP is relevant to any AI that interacts via natural language—be 
it spoken or typed—including to answer a complex series of questions automatically. 

A type of hidden layer that is incorporated into niany deep learning architectures in 
the NLP sphere is the long short-term memory (LSTM) cell, a member of the recurrent neural 
network (RNN) family. RNNs are applicable to any data that occur in a sequence such as 
financial time series data, inventory levels, traffic, and weather. We expound on RNNs, 
including LSTMs, in Chapter 11 when we incorporate them into predictive niodels 
involving natural language data. These language examples provide a firm foundation 
even ifyoure primarily seeking to apply deep learning techniques to the other classes of 
sequential data. 

Three Categories of Machine Learning 
Problems 


The one remaining section of the Venn diagram in Figure 4.1 involves reinforcement 
learning, which is the focus of the rest of this chapter. To introduce reinforcement learn¬ 
ing, we contrast it with the two other principal categories of problems that machine 
learning algorithms are often leveraged to tackle: supervised and unsupervised learning 
problems. 

Supervised Learning 

In supervised learning problems, we have both an x variable and a y variable, where: 

■ X represents the data were providing as input into our model. 

■ y represents an outcome were building a model to predict. This y variable can also 
be called a label. 

The goal with supervised learning is to have our model learn some function that uses X to 
approximate y. Supervised learning typically involves either of two types: 

■ Regression, where our y is a continuous variable. Examples include predicting the 
number of sales of a product, or predicting the future price of an asset like a 
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home (an example we provide in Chapter 9) or a share in an exchange-listed 
company. 

■ Classijkation, where our y-values consist of labeis that assign each instance of x to 
a particular category. In other words, y is a so-called categorical variable. Examples 
include identifying handwritten digits (you will code up models that do this in 
Chapter 10) or predicting whether someone who has reviewed a film loved it or 
loathed it (as you’ll do in Chapter 11). 

Unsupervised Learning 

Unsupervised learning problems are distinguishable from supervised learning problems by 
the absence of a label y. Ergo, in unsupervised learning problems, we have some data x 
that we can put into a model, but we have no outconie y to predict. Rather, our goal 
with unsupervised learning is to have our model discover some hidden, underlying struc¬ 
ture within our data. An often-used example is that of grouping news articles by their 
theme. Instead of providing a predefined list of categories that the news articles belong to 
(politics, sports, fmance, etc.), we configure the model to group those with similar topics 
for us automatically. Other examples of unsupervised learning include creating a word- 
vector space (see Chapter 2) from natural language data (you’ll do this in Chapter 11), or 
producing novel images with a generative adversarial network (as in Chapter 12). 

Reinforcement Learning 

Returning to Figure 4.1, we’re now well positioned to cover reinforcement learning prob¬ 
lems, which are markedly different from the supervised and unsupervised varieties. As 
illustrated lightheartedly in Figure 4.3, reinforcement learning problems are ones that 
we can frame as having an agent take a sequence of actions within some environment. The 
agent could, for example, be a human or an algorithm playing an Atari video game, or 
it could be a human or an algorithm driving a car. Perhaps the primary way in which 
reinforcement learning problems diverge from supervised or unsupervised ones is that the 
actions taken by the agent influence the information that the environment provides to 
the agent—that is, the agent receives direct feedback on the actions it takes. In supervised 
or unsupervised problems, in contrast, the model never impacts the underlying data; it 
siniply consumes it. 


Students of deep learning often have an innate desire to divide the supervised, 
unsupervised, and reinforcement learning paradigms into the traditional machine 
learning versus deep learning approaches. More specifically, they seem to want to 
associate supervised learning with traditional machine learning while associating 
unsupervised learning or reinforcement learning (or both) with deep learning. To 
be ciear, there is no such association to be made. Both traditional machine learn¬ 
ing and deep learning techniques can be applied to supervised, unsupervised, and 
reinforcement learning problems. 
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Figure 4.3 The reinforcement learning loop. The top diagram is a generalized version. 

The bottom diagram is specific to the example elaborated on in the text of an agent 
playing a video game on an Atari console. To our knowledge, trilobites can’t actually play 
video games; we’re using the trilobite as a symbolic representation of the reinforcement 
learning agent, which could be either a human or a machine. 
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Lefs dive a bit further into the relationship between a reinforcement learning agent 
and its environment by exploring sorne examples. In Figure 4.3, the agent is represented 
by an anthropomorphized trilobite, but this agent could be either human or a machine. 
Where the agent is playing an Atari video game: 

■ The possible actions that can be taken are the buttons that can be pressed on the 
video game controller. 4 

■ The environment (the Atari console) returns information back to the agent. This 
information comes in two delicious flavors: state (the pixels on the screen that rep- 
resent the current condition of the environment) and reward (the point score in the 
game, which is what the agent is endeavoring to maximize via gameplay). 

■ Ifthe agent is playing Pac-Man, then selecting the action ofpressing the “up” 
button results in the environment returning an updated state where the pixels rep- 
resenting the video game character on the screen have moved upward. Prior to 
playing any of the game, a typical reinforcement learning algorithm would not even 
have knowledge of this simple relationship between the “up” button and the Pac- 
Man character moving upward; everything is learned from the ground up via trial 
and error. 

■ If the agent selects an action that causes Pac-Man to cross paths with a pair of 
delectable cherries, then the environment will return a positive reward : an increase 
in points. On the other hand, if the agent selects an action that causes Pac-Man to 
cross paths with a spooky ghost, then the environment will return a negative reward: 
a decrease in points. 

In a second example, where the agent is driving a car, 

■ The available actions are much broader and richer than for Pac-Man. The agent 
can adjust the steering column, the accelerator, and the brakes to varying degrees 
ranging from subtle to dramatic. 

■ The environment in this case is the real world, consisting of roads, traffic, pedestrians, 
trees, sky, and so on. The state then is the condition of the vehicles surroundings, as 
perceived by a human agents eyes and ears, or by an autonomous vehicle s cameras 
and lidar. 5 

■ The reward , in the case of an algorithm, could be programmed to be positive for, say, 
every meter of distance traveled toward a destination; it could be somewhat negative 
for minor traffic infractions, and severely negative in the event of a collision. 

Deep Reinforcement Learning 

At long last, we reach the deep reinforcement learning section near the center of the Venn 
diagram in Figure 4.3. A reinforcement learning algorithm earns its “deep” prefix when 


4. Were not aware of video game-playing algorithms that literally press the buttons on the game console’s con- 
trollers. They would typically interact with a video game directly via a software-based emulation. We go through 
the most popular open-source packages for doing this at the end of the chapter. 

5. The laser-based equivalent of radar. 
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an artificial neural network is involved in it, sudi as to learn what actions to take when 
presented with a given state from the environment in order to have a high probability of 
obtaining a positive reward. 6 As you’ll see in the examples coniing up in the next sec- 
tion, the marriage of deep learning and reinforcement learning approaches has proved a 
prosperous one. This is because: 

■ Deep neural networks excel at processing the complex sensory input provided by 
real environments or advanced, simulated environments in order to distill relevant 
signals from a cacophony of incoming data. This is analogous to the functionality 
of the biological neurons of your brains visual and auditory cortexes, which receive 
input from the eyes and ears, respectively. 

■ Reinforcement learning algorithms, meanwhile, shine at selecting an appropriate 
action from a vast scope of possibilities. 

Taken together, deep learning and reinforcement learning are a powerful problem-solving 
combination. Increasingly complex problems tend to require increasingly large datasets 
for deep reinforcement learning agents to wade through vast noise as well as vast ran- 
domness in order to discover an effective policy for what actions it should take in a given 
circumstance. Because many reinforcement learning problems take place in a simulated 
environment, obtaining a sufficient amount of data is often not a problem: The agent can 
simply be trained on further rounds of simulations. 

Although the theoretical foundations for deep reinforcement learning have been 
around for a couple of decades, 7 as with AlexNet for vanilla deep learning (Figure 1.17), 
deep reinforcement learning has in the past few years benefited from a confluence of 
three tail winds: 

1. Exponentially larger datasets and niuch richer simulated environments 

2. Parallel computing across many graphics processing units (GPUs) to model effi- 
ciently with large datasets as well as the breadth of associated possible States and 
possible actions 

3. A research ecosystem that bridges academia and industry, producing a quickly 
developing body of new ideas on deep neural networks in general as well as on 
deep reinforcement learning algorithms in particular, to, for example, identify 
optimal actions across a wide variety of noisy States 

Video Games 


Many readers of this book recall learning a new video game as a child. Perhaps while 
at an arcade or staring at the family s heavy cathode-ray-tube television set, you quickly 
became aware that missing the ball in Pong or Breakout was an unproductive move. You 


6. Earlier in this chapter (see Figure 4.2), we indicate that the “deep learning” moniker applies to an artificial neural 
network that has at least three hidden layers. While in general this is the case, when used by the reinforcement 
learning community, the term “deep reinforcement learning” may be used even if the artificial neural network 
involved in the model is shallow, that is, composed of as few as one or two hidden layers. 

7. Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the Association for Com¬ 
puting Machinery, 38, 58—68. 
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processed the visual information on the screen and, yearning for a score in excess of your 
friends’, devised strategies to manipulate the controller effectively and achieve this aim. 

In recent years, researchers at a firm called DeepMind have been producing Software that 
likewise learns how to play classic Atari games. 

DeepMind was a British technology startup founded by Demis Hassabis (Figure 4.4), 
Shane Legg, and Mustafa Suleyman in London in 2010. Their stated mission was to 
“solve intelligence,” which is to say they were interested in extending the field of AI by 
developing increasingly general-purpose learning algorithms. One of their early con- 
tributions was the introduction of deep Q-learning networks (DQNs; noted within 
Figure 4.1). Via this approach, a single model architecture was able to learn to play multi¬ 
ple Atari 2600 games well—from scratch, siniply through trial and error. 

In 2013, Volodymyr Mnih 8 and his DeepMind colleagues published 9 an article on 
their DQN agent, a deep reinforcement learning approach that you will come to under- 
stand intimately when you construet a variant of it yourself line by line in Chapter 13. 
Their agent received raw pixel values from its environment, a video game emulator, 10 as its 
state information—akin to the way human players of Atari games view a TV screen. In 
order to efficiently process this information, Mnih et al.’s DQN included a convolutional 
neural network (CNN), a common tactic for any deep reinforcement learning model that 
is fed visual data (this is why we elected to overlap “Deep R.L” somewhat with “Machine 
Vision” in Figure 4.1). The handling of the flood of visual input from Atari games (in 
this case, a little over two million pixels per second) underscores how well suited deep 
learning in general is to filtering out pertinent features from noise. Further, playing Atari 
games within an emulator is a problem that is well suited to deep reinforcement learning 
in particular: While they provide a rich set of possible actions that are engineered to be 
challenging to master, there is thankfully no finite limit on the amount of training data 
available because the agent can engage in endless rounds of play. 



Figure 4.4 Demis Hassabis cofounded DeepMind in 2010 after completing his PhD in 
cognitive neuroscience at University College London. 


8. Mnih obtained his doctorate at the University of Toronto under the supervision of Geoff Hinton (Figure 1.16). 

9. Mnih, V., et al. (2013). Playing Atari with deep reinforcement learning. arXiv: 1312.5602. 

10. Bellemare, M., et al. (2012). The arcade learning environment: An evaluation platform for general agents. 
arXiv: 1207.4708. 
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During training, the DeepMind DQN was not provided any hints or strategies; it was 
provided only with state (screen pixels), reward (its point score, which it is programmed 
to maximize), and the range ofpossible actions (game-controller buttons) available in a 
given Atari game. The model was not altered for specific games, and yet it was able to 
outperform existing machine learning approaches in six of the seven games Mnih and 
his coworkers tested it on, even surpassing the performance of expert human players on 
three. Perhaps influenced by this conspicuous progress, Google acquired DeepMind in 
2014 for the equivalent ofhalfa billion U.S. dollars. 

In a follow-up paper published in the distinguished journal Nature, Mnih and his 
teammates at now-Google DeepMind assessed their DQN algorithm across 49 Atari 
games. 11 The results are shown in Figure 4.5: It outperformed other machine learning 
approaches on ali but three of the games (94 percent of them), and, astonishingly, it 
scored above human level on the majority of them (59 percent). 12 

Board Games 


It might sound sensible that board games would serve as a logical prelude to video games 
given their analog nature and their chronological head start; however, the use of Software 
emulators provided a simple and easy way to interact with video games digitally. Instead, 
the availability of these emulation tools provided the means, and so the principal advances 
in modern deep reinforcement learning initially took place in the realni of video games. 
Additionally, relative to Atari games, the complexity of some classical board games is 
much greater. There are myriad strategies and long-plays associated with chess expertise 
that are not readily apparent in Pac-Man or Space Invaders, for example. In this section, 
we provide an overview of how deep reinforcement learning strategies mastered the board 
games Go, chess, and shogi despite the data-availability and computational-complexity 
head winds. 

AlphaGo 

Invented several millennia ago in China, Go (lllustrated in Figure 4.6) is a ubiquitous 
two-player strategy board game in Asia. The game has a simple set of rules based around 
the idea of capturing one’s opponents’ pieces (called stones) by encircling them with one’s 
own. 13 This uncomplicated premise belies intricacy in practice, however. The larger 
board and the larger set of possible moves per turn make the game much more complex 
than, say, chess, for which we’ve had algorithms that can defeat the best human players 
for two decades. 14 There are a touch more than 2 x 10 1 ' 0 possible legal board positions 
in Go, which is far more than the number of atoms in the universe 15 and about a googol 
(IQiOO) more com pl ex than chess. 


11. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529—33. 

12. You can be entertained by watching the Google DeepMind DQN learn to master Space Invaders and Pong 
here: bi t. 1 y/DQNpong. 

13. Indeed, Go in Chinese translates literally to “encirclement board game.” 

14. IBMs Deep Blue defeated Garry Kasparov, arguably the worlds greatest-ever chess player, in 1997. More on 
that storied match coming up shortly in this section. 

15. There are an estimated IO 80 atoms in the observable universe. 
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Figure 4.6 The Go board game. One player uses the white stones while the other uses 
the black stones. The objective is to encircle the stones of your opponent, thereby 

capturing them. 


An algorithm called Monte Carlo tree search (MCTS) can be employed to play uncom- 
plicated games competently. In its purest forni, MCTS involves selecting random moves 16 
until the end of gameplay. By repeating this many times, moves that tended to lead to 
victorioris game outcomes can be weighted as favorable options. Because of the extreme 
complexity and sheer number of possibilities within sophisticated games like Go, pure 
MCTS approach is impractical: There are simply too many options to search through 
and evaluate. Instead of pure MCTS, an alternative approach involves MCTS applied to 
a much more finite subset of actions that were curated by, for example, an established 
policy of optimal play. This curated approach has proved sufficient for defeating amateur 
human Go players but is uncompetitive against professionals. To bridge the gap from 
amateur- to professional-level capability, David Silver (Figure 4.7) and his colleagues at 
Google DeepMind devised a program called AlphaGo that combines MCTS with both 
supervised learning and deep reinforcement learning. 17 


Silver et al. (2016) used supervised learning on a historical database of expert human 
Go moves to establish something called a policy network , which provides a shortlist of 
possible moves for a given situation. Subsequently, this policy network was refmed 
via self-play deep reinforcement learning, wherein both opponents are Go-playing 
agents of a comparable skill level. Through this self-play, the agent iteratively im- 
proves upon itself, and whenever it improves, it is pitted against its now-improved 
self, producing a positive-feedback loop of continuous advancement. Finally, the 
cherry atop the AlphaGo algorithm: a so-called value network that predicts the winner 
of the self-play games, thereby evaluating positions on the board and learning to 
identify strong moves. The combination of these policy and value networks (more 
on both of these in Chapter 13) reduces the breadth of search space for the MCTS. 


16. Hence “Monte Carlo”: The casino-dense district ofMonaco evokes imagery of random outcomes. 

17. Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 
484-9. 
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Figure 4.7 David Silver is a Cambridge- and Alberta-educated researcher at Google 
DeepMind. He has been instrumental in combining the deep learning and reinforcement 

learning paradigms. 


AIphaGo was able to win the vast majority of games it played against other computer- 
based Go programs. Perhaps most strikingly, AIphaGo was also able to defeat Fan Hui, 
the then-reigning European Go Champion, five games to zero. This marked the first time 
a computer defeated a professional human player in a full play of the game. As exempli- 
fied by the Elo ratings 18 in Figure 4.8, AIphaGo performed at or above the level of the 
best players in the world. 

Following this success, AIphaGo was famously matched against Lee Sedol in March 
2016 in Seoul, South Korea. Sedol has 18 world tities and is considered one ofthe all- 
time great players. The five-game match was broadcast and viewed live by 200 million 
people. AIphaGo won the match 4-1, launching DeepMind, Go, and the artificially 
intelligent future into the public imagination. 19 

AIphaGo Zero 

Following AIphaGo, the folks at DeepMind took their work further and created a 
second-generation Go player: AIphaGo Zero. Recall that AIphaGo was initially trained 
in a supervised manner; that is, expert human moves were used to train the network first, 
and thereafter the network learned by reinforcement learning through self-play. Although 
this is a nifty approach, it doesn’t exactly “solve intelligence” as DeepMinds founders 
would have liked. A better approximation of general intelligence would be a network that 


18. Elo ratings enable the skill level of human and artificial game players alike to be compared. Derived ffom 
calculations of head-to-head wins and losses, an individual with a higher Elo score is more likely to win a game 
against an opponent with a lower score. The larger the score gap between the two players, the greater the probability 
that the player with the higher score will win. 

19. There is an outstanding documentary on this Sedol match that gave us chills: Kohs, G. (2017). AIphaGo. United 
States: Moxie Pictures & Reel As Dirt. 
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Figure 4.8 The Elo score of AlphaGo (blue) relative to Fari Hui (green) and several Go 
programs (red). The approximate human rank is shown on the right. 


could learn to play Go in a completely de novo setting—where the network is not sup- 
plied with any human input or doniain knowledge, but improves by deep reinforcement 
learning alone. Enter AlphaGo Zero. 

As weVe alluded to before, the game of Go requires sophisticated look-ahead capabil- 
lties through vast search spaces. That is, there are so many possible moves and such a tiny 
fraction of them are good moves in the short- and longplay of the game that performing a 
search for the optimal move, keeping the likely future state of the game in mind, becomes 
exceedingly complex and computationally impractical. It is for this reason that it was 
thought that Go would be a fmal frontier for machine intelligence; indeed, it was thought 
that the achievements of AlphaGo in 2016 were a decade or more away. 

Working off the momentum from the AlphaGo-Sedol match in Seoul, researchers at 
DeepMind created AlphaGo Zero, which learns to play Go far beyond the level of the 
original AlphaGo—while being revolutionary in several ways. 2 " First and foremost, it is 
trained without any data from human gameplay. That means it learns purely by trial and 
error. Second, it uses only the stones on the board as inputs. Contrastingly, AlphaGo had 
received 15 supplementary, human-engineered features, which provided the algorithm 
key hints such as how many turns since a move was played or how many opponent stones 
would be captured. Third, a single (deep) neural network was used to evaluate the board 
and decide on a next move, rather than separate policy and value networks (as mentioned 
in the sidebar on page 61; more on these coming in Chapter 13). Finally, the tree search 
is simpler and relies on the neural network to evaluate positions and possible moves. 

AlphaGo Zero played almost five million games of self-play over three days, taking 
an estimated 0.4s per move to “think.” Within 36 hours, it had begun to outperform 


20. Silver, D., et al. (2016). Mastering the game of Go without human knowledge. Nature 550, 354—359. 
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the model that beat Lee Sedol in Seoul (retroactively termed AlphaGo Lee), which—in 
stark contrast—took several months to train. At the 72-hour mark, the model was pitted 
against AlphaGo Lee in match conditions, where it handily won every single one of 100 
games. Even more remarkable is that AlphaGo Zero achieved this on a single machine 
with four tensor processing units (TPUs) 21 whereas AlphaGo Lee was distributed over 
multiple machines and used 48 TPUs. (AlphaGo Fan, which beat Fan Hui, was dis¬ 
tributed over 176 GPUs!) In Figure 4.9, the Elo score for AlphaGo Zero is shown over 
days of training time and compared to the scores for AlphaGo Master 22 and AlphaGo Lee. 
Shown on the right are the absolute Elo scores for a variety of iterations of AlphaGo and 
some other Go programs. AlphaGo Zero is far and away the superior model. 

A startling discovery that emerged from this research was that the nature of the game- 
play by AlphaGo Zero is qualitatively different from that of human players and (the 
human gameplay-trained) AlphaGo Lee. AlphaGo Zero began with random play but 
quickly learned professional joseki —corner sequences that are considered heuristics of 
distinguished play. After further training, however, the mature model tended to prefer 
novel joseki that were previously unknown to humankind. AlphaGo Zero did sponta- 
neously learn a whole range of classical Go moves, implying a pragmatic alignment with 
these techniques. However, the model did this in an original manner: It did not learn the 
concept of shicho (ladder sequences), for example, until much later in its training, whereas 
this is one of the first concepts taught to novice human players. The authors additionally 
trained another iteration of the model with human gameplay data. This supervised model 
performed better initially; however, it began to succumb to the data-free model within 
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Figure 4.9 Comparing Elo scores between AlphaGo Zero and other AlphaGo 
variatioris or other Go programs. In the left-hand plot, the comparison is over days of 

AlphaGo Zero training. 


21. Google built custom processor units for training neural networks, known as tensor processing units (TPUs). 
They took the existing architecture of a GPU and specifically optimized it for performing calculations that 
predominate the training of neural network models. At the time of writing, TPUs were accessible to the public 
via the Google Cloud Platform only. 

22. AlphaGo Master is a hybrid between AlphaGo Lee and AlphaGo Zero; however, it uses the extra input features 
enjoyed by AlphaGo Lee and initializes training in a supervised manner. AlphaGo Master famously played Online 
anonymously in January 2017 under the pseudonyms Master and Magister. It won ali 60 of the games it played 
against some of the worlds strongest Go players. 
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the first 24 hours of training and ultimately achieved a lower Elo score. Together, these 
results suggest that the data-free, self-learned model has a style of play distinet from that of 
human players—a dominating style that the supervised model fails to develop. 

AlphaZero 

Having trounced the Go community, the team at DeepMind shifted their focus to gen- 
eral game-playing neural networks. Although AlphaGo Zero is adept at playing Go, they 
wondered if a comparable network could learn to play multiple games expertly. To put 
this to the test, they added two new games to their repertoire: chess and shogi. 23 

Most readers are likely familiar with the game of chess, and shogi—referred to by 
some as Japanese chess—is similar. Both games are two-player strategy games, both take 
place on a grid-format board, both culminate in a checkmate of the opponent’s king, 
and both consist of a range of pieces with different moving abilities. Shogi, however, is 
significantly more complex than chess, with a larger board size (9x9, relative to 8x8 in 
chess) and the fact that opponent pieces can be replaced anywhere on the board after their 
capture. 

Historically, artificial intelligence has had a rich interaction with the game of chess. 
Over several decades, chess-playing computer programs have been developed exten- 
sively. The most famous is Deep Blue, conceived by IBM, which went on to beat the 
world Champion Garry Kasparov in 1997. 24 It was heavily reliant on brute-force com- 
puting power 25 to exeeute complex searches through possible moves, and combined this 
with handcrafted features and domain-specific adaptations. Deep Blue was fme-tuned by 
analyzing thousands of master games (it was a supervised learning system!) and it was even 
tweaked between games. 26 

Although Deep Blue was an achievement two decades ago, the system was not gen- 
eralizable; it could not perform any task other than chess. After AlphaGo Zero demon- 
strated that the game of Go could be learned by a neural network from first principies 
alone, given nothing but the board and the rules of the game, Silver and his DeepMind 
colleagues set out to devise a generalist neural network, a single network architecture that 
could dominate not only at Go but also at other board games. 

Compared to Go, chess and shogi present pronounced obstacles. The rules of the 
games are position dependent (pieces can move differently based on where they are on 
the board) and asymmetrical (some pieces can move only in one direction). 27 Long-range 
actions are possible (such as the queen moving across the entire board), and games can 
resuit in draws. 


23. Silver, D., et al. (2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm. 
arXiv:1712.01815. 

24. Deep Blue lost its first mateh against Kasparov in 1996, and after significant upgrades went on to narrowly beat 
Kasparov in 1997. This was not the total domination of man by machine that AI proponents might have hoped 
for. 

25. Deep Blue was the planefs 259th most powerful supercomputer at the time of the mateh against Kasparov. 

26. This tweaking was a point of contention between IBM and Kasparov after his loss in 1997. IBM refused 
to release the programs logs and dismantled Deep Blue. Their computer system never received an official chess 
ranking, because it played so few games against rated chess masters. 

27. This makes expanding the training data via synthetic augmentation—an approach used copiously for 
AlphaGo—more challenging. 
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AlphaZero feeds the board positions into a neural network and outputs a vector of 
move probabilities for each possible action, as well as a scalar 28 outcome value for that 
move. The network learns the parameters for these move probabilities and outcomes 
entirely from self-play deep reinforcement learning, as AlphaGo Zero did. An MCTS is 
then performed on the reduced space guided by these probabilities, returning a refined 
vector of probabilities over the possible moves. Whereas AlphaGo Zero optimizes the 
probability of winning (Go is a bmary win/loss game), AlphaZero instead optimizes for 
the expected outcome. During self-play, AlphaGo Zero retains the best player to date and 
evaluates updated versions of itself against that player, continually replacing the player 
with the next best version. AlphaZero, in contrast, maintains a single network and at any 
given time is playing against the latest version of itself. AlphaZero was trained to play each 
of chess, shogi, and Go for a mere 24 hours. There were no game-specific modifications, 
with the exception of a manually configured parameter that regulates how frequently the 
model takes random, exploratory moves; this was scaled to the number oflegal moves in 
each game. 29 

Across 100 competitive games, AlphaZero did not lose a single one against the 2016 
Top Chess Engine Championship world Champion Stockfish. In shogi, the Computer 
Shogi Association world Champion, Elmo, managed to beat AlphaZero only eight times 
in 100 games. Its perhaps most worthy opponent, AlphaGo Zero, was able to defeat 
AlphaZero in 40 oftheir 100 games. Figure 4.10 shows the Elo scores for AlphaZero 
relative to these three adversaries. 

Not only was AlphaZero superior, it was also efficient. AlphaZero’s Elo score 
exceeded its greatest foes’ after only two, four, and eight hours of training for shogi, 
chess, and Go, respectively. This is a sensationally rapid rate of learning, considering 
that in the case of Elmo and Stockfish, these computer programs represent the culmina- 
tion of decades of research and fine-tuning in a focused, domain-specific manner. The 
generalizable AlphaZero algorithm is able to play ali three games with aplomb: Simply 
switching out learned weights from otherwise identical neural network architectures 
imbues each with the same skills that have taken years to develop by other means. These 
results demonstrate that deep reinforcement learning is a strikingly powerful approach for 
developing general expert gameplay in an undirected fashion. 




Thousands of Steps Thousands of Steps 


- AlphaZero 

- AlphaGo Zero 

- AlphaGo Lee 


100 200 300 400 500 600 700 
Thousands of Steps 


Figure 4.10 Comparing Elo scores between AlphaZero and each of its opponents in 
chess, shogi, and Go. AlphaZero rapidly outperformed all three opponents. 


28. A single value. 

29. This manually configured exploration parameter is called epsilon. It is detailed in Chapter 13. 
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Manipulation of Objects 

As this chapters title might suggest, thus far we’ve centered our coverage of deep rein- 
forcement learning 011 its game-playing applications. Although games offer a hot testbed 
for exploring the generalization of machine intelligence, in this section we spend a 
few moments expounding on practical, real-world applications of deep reinforcement 
learning. 

One real-world example we mention earlier in this chapter is autonomous vehicles. 

As an additional example, here we provide an overview of research by Sergey Levine, 
Chelsea Finn (Figure 4.11), and labmates at the University of California, Berkeley. 30 
These researchers trained a robot to perform a number of motor skills that require com¬ 
plex visual understanding and depth perception, such as screwing the cap back onto a 
bottle, removing a nail with a toy hammer, placing a hanger on a rack, and inserting a 
cube in a shape-fitting game (Figure 4.12). 

Levine, Finn, and colleagues’ algorithm maps raw visual input directly to the niove- 
ment of the motors in the robofs arm. Their policy network was a seven-layer-deep 
convolutional neural network (CNN) consisting offewer than 100,000 artificial 
neurons—a minuscule amount in deep learning ternis, as you’ll see when you train 
orders-of-magnitude larger networks later in this book. Although it would be tricky to 
elaborate further on this approach before we delve much into artificial-neural-network 
theory (in Part II, which is just around the corner), there are three take-away points we’d 
like to highlight on this elegant practical application of deep reinforcement learning. First, 
it is an “end-to-end” deep learning model in the sense that the model takes in raw im- 
ages (pixels) as inputs and then outputs directly to the robot s motors. Second, the model 



Figure 4.11 Chelsea Finn is a doctoral candidate at the University of California, 
Berkeley, in its AI Research Lab. 


30. Levine, S., Finn, C., et al. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning 
Research, 17, 1—40. 
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(a) hanger (b) cubc (c) hammer (d) bottle 

Figure 4.12 Sample images from Levine, Finn, et al. (2016) exhibiting various 
object-manipulation actions the robot was trained to perform 


generalizes neatly to a broad range of unique object-manipulation tasks. Third, it is an 
example of the policy gradient family of deep reinforcement learning approaches, rounding 
out the ternis featured in the Venn diagram in Figure 4.1. Policy gradient methods are 
distinet from the DQN approach that is the focus of Chapter 13, but we touch on them 
then too. 

Popular Deep Reinforcement Learning 
Environments 


Over the past few sections, we talk a fair bit about Software emulation of environments 
in which to train reinforcement learning models. This area of development is crucial 
to the ongoing progression of reinforcement learning; without environments in which 
our agents can play and explore (and gather data!), there would be no training of mod¬ 
els. Here we introduce the three most popular environments, discussing their high-level 
attributes. 

OpenAI Gym 

OpenAI Gym 31 is developed by the nonprofit AI research company OpenAI. 32 The 
mission of OpenAI is to advance artificial general intelligence (more on that in the next 
section!) in a safe and equitable manner. To that end, the researchers at OpenAI have 
produced and open-sourced a number of tools for AI research, including the OpenAI 


31. github.com/openai/gym 

32. openai.com 
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Gym. This toolkit is designed to provide an interface for training reinforcement learning 
models, be they deep or otherwise. 

As captured in Figure 4.13, the Gym features a diverse array of environments, includ- 
ing a number of Atari 2600 games, 33 multiple robotics simulators, a few simple text-based 
algorithmic games, and several robotics simulations using the MuJoCo physics engine. 34 
In Chapter 13, you’ll install OpenAI Gym in a single line of code and then employ an 
environment it provides to train the DQN agent that you build. OpenAI Gym is writ- 
ten in Python and is compatible with any deep learning computation library, including 
TensorFlow and PyTorch (we discuss the various deep learning libraries in Chapter 14; 
these are two particularly popular ones). 

DeepMind Lab 

DeepMind Lab 35 is another RL environment, this time from the developers at Google 
DeepMind (although they point out that DeepMind Lab is not an official Google prod- 
uct). As can be seen in Figure 4.14, the environment is built on top ofid Software s 
Quake III Arena j6 and provides a sci-fi inspired three-dimensional world in which agents 
can explore. The agent experiences the environment from the first-person perspective, 
which is distinet from the Atari emulators available via OpenAI Gym. 

There are a variety oflevels available, which can be roughly divided into four cate- 
gories: 

1. Fruit-gathering levels, where the agent simply tries to fmd and collect rewards 
(apples and melons) while avoiding penalties (lemons). 

2. Navigation levels with a static map, where the agent is tasked with finding a goal 
and remembering the layout of the map. The agent can either be randomly placed 
within a map at the start of each episode while the goal remains stationary, an 
arrangement that tests initial exploration followed by a reliance on memory to re- 
peatedly fmd the goal; or the agent can start in the same place while the goal is 
moved for every episode, testing the agents ability to explore. 

3. Navigation levels with random maps, where the agent is required to explore a novel 
map in each episode, find the goal, and then repeatedly return to the goal as many 
times as possible within a time limit. 

4. Laser-tag levels, where the agent is rewarded for hunting and attacking bots in an 
array of different scenes. 

Installation of DeepMind Lab is not as straightforward as OpenAI Gym, 37 but it pro¬ 
vides a rich, dynamic first-person environment in which to train agents, and the levels 


33. OpenAI Gym uses the Arcade Learning Environment to emulate Atari 2600 games. This same framework is 
used in the Mnih et al. (2013) paper described in the “Video Games” section. You can find the framework yourself 
at github.com/mgbel1emare/Arcade-Learning-Environment. 

34. MuJoCo is an abbreviation of Multi-Joint dynamies with Contact. It is a physics engine that was developed by 
Emo Todorov for Roboti LLC. 

35. Beattie, C. et al. (2016). DeepMind Lab. arXiv:1612.03801. 

36. Quake III Arena. (1999). United States: id Software. github.com/id-Software/Quake-III-Arena 

37. First the Github repository (github.com/deepmind/lab) is cloned, and then the Software must be built using 
Bazel (bit.ly/installB). The DeepMind Lab repository provides detailed instructions (bi t. 1 y / bui 1 dDML). 
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Figure 4.13 A sampling of OpenAI Gym environments: (a) CartPole, a classic 
control-theory problem; (b) LunarLander, a continuous-control task run inside a 
two-dimensional simulation; (c) Skiing, an Atari 2600 game; (d) Humanoid, a 
three-dimensional MuJuCo physics engine simulation of a bipedal person; 

(e) FetchPickAndPIace, one of several available simulations of real-world robot arms, in 
this case involving one called Fetch, with the goal of grasping a block and placing it in a 
target location; and (f) FlandManipulateBlock, another practical simulation of a robotic 
arm, the Shadow Dexterous Fland. 
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Figure 4.14 A DeepMind Lab environment, in which positive-reward points are 
awarded for capturing scrumptious green apples 


provide complex scenarios involving navigation, memory, strategy, planning, and fine- 
motor skills. These challenging environments can test the limits of what is tractable with 
contemporary deep reinforcement learning. 

Unity ML-Agents 

Unity is a sophisticated engine for two- and three-dimensional video games and digital 
simulations. Given the game-playing proficiency of reinforcement learning algorithms we 
chronicle earlier in this chapter, it may come as no surprise that the makers of a popular 
game engine are also in the business of providing environments to incorporate reinforce¬ 
ment learning into video games. The Unity ML-Agents plug-in 38 enables reinforcement 
learning models to be trained within Unity-based video games or simulations and, per- 
haps more fitting with the purpose of Unity itself, allows reinforcement learning models 
to guide the actions of agents within the game. 

As with DeepMind Lab, installation of Unity ML-Agents is not a one-liner. 39 

Three Categories of AI 

Of ali deep learning topics, deep reinforcement learning is perhaps the one most closely 
tied to the popular perception of artificial intelligence as a System for replicating the cog¬ 
nitive, decision-making capacity of humans. In light ofthat, to wrap up this chapter, in 
this section we introduce three categories of AI. 


38. github.com/Unity-Technologies/ml-agents 

39. It requires the user to first install Unity (for download and installation instructions, see store. unity 
. com/downl oad) and then clone the Github repository. Full instructions are available at the Unity ML-Agents 
Github repository (bi t . 1 y /MLagents). 
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Artificial Narrow Intelligence 

Artificia1 narrow intelligence (ANI) is machine expertise at a specific task. Many diverse 
examples of ANI exist today, and we’ve mentioned plenty already in this book, such as 
the visual recognition of objects, real-time machine translation between natural languages, 
automated financial-trading systems, AlphaZero, and self-driving cars. 

Artificial General Intelligence 

Artificial generat intelligence (AGI) would involve a single algorithm that could perform well 
at ali of the tasks described in the preceding paragraph: It would be able to recognize your 
face, translate this book into another language, optimize your investment portfolio, beat 
you at Go, and drive you safely to your holiday destination. Indeed, such an algorithm 
would be approximately indistinguishable from the intellectual capabilities of an individual 
human. There are countless hurdles to overcome in order for AGI to be realized; it is 
challenging to approximate when it will be achieved, if it will be achieved at ali. That 
said, AI experts are happy to wave a finger in the air and speculate on timing. In a study 
conducted by the philosopher Vincent Mulier and the influential futurist Nick Bostrom, 40 
the median estimate across hundreds of professional AI researchers is that AGI will be 
attained in the year 2040. 

Artificial Super Intelligence 

Artificial super intelligence (ASI) is difficult to describe because it’s properly mind-boggling. 
ASI would be an algorithm that is markedly more advanced than the intellectual capa¬ 
bilities of a human. 41 If AGI is possible, then ASI may be as well. Of course, there are 
even more hurdles on the road to ASI than to AGI, the bulk of which we can’t foresee 
today. Citing the MiiUer and Bostrom survey again, however, AI experts’ median estimate 
for the arrival of ASI is 2060, a rather hypothetical date that falis within the life-span of 
many earthlings alive today. In Chapter 14, at which point you’ll be well-versed in deep 
learning both in theory and in practice, we discuss both how deep learning models could 
contribute to AGI as well as the present limitations associated with deep learning that 
would need to be bridged in order to attain AGI or, gasp, ASI. 

Summary 

The chapter began with an overview relating deep learning to the broader field of artifi¬ 
cial intelligence. We then detailed deep reinforcement learning, an approach that blends 
deep learning with the feedback-providing reinforcement learning paradigm. As discussed 
via real-world examples ranging from the board game Go to the grasping of physical 
objects, such deep reinforcement learning enables machines to process vast amounts of 
data and take sensible sequences of actions on complex tasks, associating it with popular 
conceptions of AI. 


40. Mulier, V., and Bostrom, N. (2014). Future progress in artificial intelligence: A survey of expert opinion. In 
V. Mulier (Ed.), Fundamental Issues of Artificial Intelligence. Berlin: Springer. 

41. In 2015, the writer and illustrator Tim Urban provided a two-part series of posts that rivetingly covers ASI 
and the related literature. Its available at bi t. 1 y/urbanAI for you to enjoy. 
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The (Code) Cart Ahead of the 

(Theory) Horse 


I n Part I, we provided a high-level overview of deep learning by demonstrating its use 
across a spectrum of cutting-edge applications. Along the way, we sprinkled in founda- 
tional deep learning concepts from its hierarchical, representation-learning nature through 
to its relationship to the field of artificial intelligence. Repeatedly, as we touched on a 
concept, we noted that in Part II of the book we would dive into the low-level theory 
and mathematics behind it. While we promise this is true, we are going to take this final 
opportunity to put the fun, hands-on coding cart ahead of the proverbial—in this case, 
theory-laden—horse. 

In this chapter we do a line-by-line walk-through of a notebook of code featuring 
a neural network niodel. While you will need to bear with us because we have not yet 
detailed much of the theory underpinning the code, this serpentine approach will make 
the apprehension of theory in the subsequent chapters easier: Instead ofbeing an abstract 
idea, each element of theory we introduce in this part of the book will be rooted in a 
tangible line of applied code. 

Prerequisites 

Working through the examples in this book will be easiest if you are familiar with the 
basies of the Unix command line. These are provided by Zed Shaw in Appendix A of his 
deceptively enjoyable Leam Python the Hard Way . 1 

Speaking of Python, since it is comfortably the most popular Software language in the 
data Science community (at time ofwriting, anyway), ifs the language we selected for our 
example code throughout the book. Python s prevalence extends across the composition 
of stand-alone Scripts through to the deployment of machine learning models into pro- 
duction systems. It youre new to Python or youre feeling a tad rusty, Shaws book serves 


1. Shaw, Z. (2013). Leam Python the Hard Way, 3rd Ed. New York: Addison-Wesley. This relevant appendix, Shaw’s 
“Command Line Crash Course,” is available online at learnpythonthehardway.org/book/appendixa.html. 
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as an appropriate general reference, while Daniel Chens Pandas for Everyone 2 is ideal for 
learning how to apply the language to data modeling in particular. 

Installation 


Regardless of whether youre planning on executing our code notebooks via Unix, 

Linux, macOS, or Windows, we have made step-by-step installation instructions available 
in the GitHub repository that accompanies this book: 

github.com/the-deep-1earners/deep-1earning-i11ustrated 

If you’d prefer to view the completed notebooks instead of running them on your own 
machine, you are more than welcome to do that from the GitHub repo as well. 

We elected to provide our code within the comfort of interactive Jupyter notebooks. 3 
Jupyter is a common option today for writing and sharing Scripts, particularly during ex- 
ploratory phases in which a data scientist is experimenting with preprocessing, visualizing, 
and modeling her data. Our installation instructions suggest running Jupyter from within 
a Docker Container. 4 This containerization ensures that you 11 have all of the Software 
dependencies you need to run our notebooks while simultaneously preventing these 
dependencies from clashing with Software you already have installed on your system. 

A Shallow Network in Keras 

To kick off the code portion of our book, we will: 

1. Detail a revered dataset of handwritten digits 

2. Load these data into a Jupyter notebook 

3. Use Python to prepare the data for modeling 

4. Write a few lines of code in Keras, a high-level deep learning API, to construet an 
artificial neural network (in TensorFlow, behind the scenes) that predicts what digit 
a given handwritten sample represents 

The MNIST Handwritten Digits 

Back in Chapter 1 when we introduced the LeNet-5 machine vision architecture 
(Figure 1.11), we mentioned that one of the advantages Yann LeCun (Figure 1.9) and 
his colleagues had over previous deep learning practitioners was a superior dataset for 
training their model. This dataset of handwritten digits, called MNIST (see the sam- 
ples in Figure 5.1), came up again in the context ofbeing imitated by Ian Goodfellows 
generative adversarial network (Figure 3.2a). The MNIST dataset is ubiquitous across 


2. Chen, D. (2017). Pandas for Everyone: Python Data Analysis. New York: Addison-Wesley. 

3. jupyter.org. We recommend familiarizing yourself with the hot keys to breeze through Jupyter notebooks 
with pizzazz. 

4. docker.com 
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Figure 5.1 A sample of a dozen images from the MNIST dataset. Each image contains 
a single digit handwritten by either a high school student or a U.S. census worker. 


deep learning tutorials, and for good reason. By modern standards, the dataset is small 
enough that it can be modeled rapidly, even on a laptop computer processor. In addition 
to their portable size, the MNIST digits are handy because they occupy a sweet spot with 
respect to how challenging they are to classify: The handwriting samples are sufficiently 
diverse and contain complex enough details that they are not easy for a machine-learning 
algorithm to identify with high accuracy, and yet by no means do they pose an insur- 
mountable problem. However, as you will observe yourself as we make our way through 
Part II of this book, a well-designed deep-learning model can nearly faultlessly classify the 
handwriting as the appropriate digit. 

The MNIST dataset was curated by LeCun (Figure 1.9), Corinna Cortes (Figure 5.2), 
and the Microsoft-Al-researcher-turned-musician Chris Burges in the 1990s. 5 It consists 
of 60,000 handwritten digits for training an algorithm, and 10,000 more for validating the 
algorithms performance on previously unseen data. The data are a subset (a modification) 
of a larger body of handwriting samples collected from high school students and census 
workers by the U.S. National Institute of Standards and Technology (NIST). 

As exemplified by Figure 5.3, every MNIST digit is a 28x28-pixel image. 6 Each pixel 
is 8-bit, meaning that the pixel darkness can vary from 0 (white) to 255 (black), with the 
intervening range of integers representing gradually darker shades of gray. 

A Schematic Diagram of the Network 

In our Shallow Net in Keras Jupyter notebook, 7 we create an artificial neural network to 
predict what digit a given handwritten MNIST image represents. As shown in the rough 
schematic diagram in Figure 5.4, this artificial neural network features one hidden layer 
of artificial neurons, for a total of three layers. Recalling Figure 4.2, with so few layers 


5. yann.lecun.com/exdb/mnist/ 

6. Python uses zero indexing, so the first row and column are denoted with 0. The 28th row and 28th column of 
pixels are therefore both denoted with 27. 

7. Within this books GitHub repository, navigate into the notebooks directory. 
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Figure 5.2 The Danish computer scientist Corinna Cortes is head of research at 
Google’s New York office. Among her countless contributions to both pure and applied 
machine learning, Cortes (with Chris Burges and Yann LeCun) curated the widely used 

MNIST dataset. 



Figure 5.3 Each handwritten MNIST digit is stored as a 28x28-pixel grayscale image. 
See the Jupyter notebook titied MNIST Digit Pixel by Pixel that accompanies this book for 
the code we used to create this figure. 


this ANN would not generally be considered a deep learning architecture; hence it is 
shallow. 

The first layer of the network is reserved for inputting our MNIST digits. Because 
they are 28x28-pixel images, each one has a total of 784 values. After we load in the 
images, we’11 flatten them from their native, two-dimensional 28x28 shape to a one- 
dimensional array of 784 elements. 
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input 


hidden 


output 



Figure 5.4 A rough schematic of the shaliow artificial-neural-network architecture we’re 
whipping up in this chapter. We detail the particular sigmoid and softmax flavors of 
artificial neurons in Chapters 6 and 7, respectively. 


You could argue that collapsing the images from two dimensions to one will cause 
us to lose a lot of the meaningful structure of the handwritten digits. Well, if you 
argued that, you’d be right! Working with one-dimensional data, however, means 
we can use relatively unsophisticated neural network models, which is appropriate 
at this early stage in our journey. Later, in Chapter 10, you’ll be in a position to 
appreciate more-complex models that can handle multidimensional inputs. 


The pixel-data inputs will be passed through a single, hidden layer of 64 artificial 
neurons. 8 The number (64) and type ( sigmoid ) of these neurons aren’t critical details at 
present; we begin to explain these model attributes in the next chapter. The key piece 
of information at this time is that, as we demonstrate in Chapter 1 (see Figures 1.18 and 
1.19), the neurons in the hidden layer are responsible for learning representations of the 
input data so that the network can predict what digit a given image represents. 

Finally, the information that is produced by the hidden layer will be passed to 10 neu¬ 
rons in the output layer. We detail how a softmax layer of neurons works in Chapter 7, 
but, in essence, we have 10 neurons because we have 10 categories of digit to classify. 
Each of these 10 neurons outputs a probability: one for each ofthe 10 possible digits that 
a given MNIST image could represent. As an example, a fairly well-trained network that 
is fed the image in Figure 5.3 might output that there is a 0.92 probability that the im¬ 
age is of a three, a 0.06 probability that it’s a two, a 0.02 probability that it s an eight, and a 
probability of 0 for the other seven digits. 

Loading the Data 

At the top of the notebook we import our Software dependencies, which is the unexcit- 
ing but necessary step shown in Example 5.1. 


8. “Hidden” layers are so called because they are not exposed; data impact them only indirectly, via the input layer 
or the output layer of neurons. 
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Example 5.1 Software dependencies for shallow net in Keras 

import keras 

from keras.datasets import mnist 
from keras.mode!s import Sequential 
from keras.layers import Dense 
from keras.optimizers import SGD 
from matplotlib import pyplot as plt 

We import Keras because that’s the library were using to fashion our neural network. We 
also import the MNIST dataset because these, of course, are the data were working with 
in this example. The lines ending in Sequential, Dense, and SGD will make sense later; 
no need to worry about them at this stage. Finally, the matpl otl i b line will enable us to 
plot MNIST digits to our screen. 

With these dependencies imported, we can conveniently load the MNIST data in a 
single line of code, as in Example 5.2. 

Example 5.2 Loading MNIST data 

(X_train, y_train), (X_valid, y_valid) = mnist.1oad_data() 

Let’s examine these data. As mentioned in Chapter 4, the mathematical notation x is used 
to represent the data were feeding into a model as input, while y is used for the labeled 
output that were training the model to predict. With this in mind, X_trai n Stores the 
MNIST digits we’11 be training our model on. 9 Executing X_trai n . shape yields the 
output (60000 , 28, 28). This shows us that, as expected, we have 60,000 images in our 
training dataset, each of which is a 28x28 matrix of values. Running y_trai n . shape, 
we unsurprisingly discover we have 60,000 labeis indicating what digit is contained 
in each of the 60,000 training images. y_trai n [0 :12] outputs an array of 12 integers 
representing the first dozen labeis, so we can see that the first handwritten digit in the 
training set (X_trai n [0]) is the number five, the second is a zero, the third is a four, 
and so on. 

array([5, 0, 4 , 1 , 9 , 2, 1 , 3, 1, 4 , 3, 5], dtype=uint8) 

These happen to be the same dozen MNIST digits that were shown earlier in Figure 5.1, 
a figure we created by running the following chunk of code: 

plt.fi gure(figsize=(5,5)) 
for k in range(12) : 

plt.subplot(3, 4, k+1) 


9. The convention is to use an uppercase letter like X when the variable being represented is a two-dimensional 
matrix or a data structure with even higher dimensionality. I 11 contrast, a lowercase letter like X is used to represent 
a single value (a scalar) or a one-dimensional array. 
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Figure 5.5 The first MNIST digit in the validation dataset (X_val id [0] ) is a severi. 



plt.imshow(X_train[k], cmap= 'Greys' ) 
plt.axis(' off' ) 
plt.tight_layout() 
plt.show() 


Akin to the training data, by examining the shape of the validation data 
(X_val i d . shape, y_val i d . shape), we note that there are the expected 10,000 28x28- 
pixel validation images, each with a corresponding label: (10000, 28, 28), (10000,). 
Investigating the values that rnake up an individual image such as X_val i d [0] , we ob- 
serve that the matrix of integers representing the handwriting is primarily zeros (white- 
space). Tilting your head, you might even be able to make out that the digit in this 
example is a seven with the highest integers (e.g., 254, 255) representing the black core of 
the handwritten figure, and the outline of the figure (composed of intermediate integers) 
fading toward white. To corroborate that this is indeed the number seven, we both printed 
out the image using pl t. imshow(X_val i d [0] , cmap= ' Greys ') (output shown in 
Figure 5.5) and printed out its label using y_val id [0] (output was 7). 

Reformatting the Data 

The MNIST data now loaded, we come across the heading “Preprocess data” in the 
notebook. We won’t, however, be preprocessing the images by applying functions to, say, 
extract features that provide hints to our artificial neural network. Instead, we will simply 
be rearranging the shape of the data so that they match up with the shapes of the input 
and output layers of the network. 

Thus, we’11 flatten our 28x28-pixel images into 784-element arrays. We employ the 
reshape() method, as shown in Example 5.3. 



82 


Chapter 5 The (Code) Cart Ahead of the (Theory) Horse 


Example 5.3 Flattening two-dimensional images to one dimension 

X_train = X_train.reshape(60000, 784) .astype( 'f1oat32' ) 

X_va1id = X_va1id.reshape(1 0000 , 784) .astype( 'f1oat32' ) 

Simultaneously, we use astype( 'f1oat32' ) to convert the pixel darknesses from integers 
into single-precision float values. 10 This conversion is preparation for the subsequent step, 
shown in Example 5.4, in which we divide ali of the values by 255 so that they range 
from 0 to 1. * 11 

Example 5.4 Converting pixel integers to floats 

X_train /= 255 
X_valid /= 255 

Revisiting our example handwritten seven from Figure 5.5 by running X_val id [0], we 
can verify that it is now represented by a one-dimensional array made up of float values as 
low as 0 and as high as 1. 

Thafs ali for reformatting our model inputs A’. As shown in Example 5.5, for the 
labeis y, we need to convert them from integers into one-hot encodings (shortly we 
demonstrate what these are via a hands-on example). 

Example 5.5 Converting integer labeis to one-hot 

n_classes = 10 

y_train = keras.uti 1s.to_categorical(y_train, n_classes) 
y_vaiid = keras.uti is.to_categorical(y_valid, n_classes) 

There are 10 possible handwritten digits, so we set n_cl asses equal to 10. In the other 
two lines of code we use a convenient utility function— to_categori cal , which is pro- 
vided within the Keras library—to transform both the training and the validation labeis 
from integers into the one-hot format. Execute y_val i d to see how the label seven is 
represented now: 

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], dtype=f1oat32) 

Instead of using an integer to represent seven, we have an array oflength 10 consisting en- 
tirely of Os, with the exception of a 1 in the eighth position. In such a one-hot encoding, 
the label zero would be represented by a Ione 1 in the first position, one by a Ione 1 in 


10. The data are initially stored as ui nt8, which is an unsigned integer from 0 to 255. This is more memory effi¬ 
cient, but it doesn’t require much precision because there are only 256 possible values. Without specifying, Python 
would default to a 64-bit float, which would be overkill. Thus, by specifying a 32-bit float we can deliberately 
specify a lower-precision float that is sufficient for this use case. 

11. Machine learning models tend to learn more efficiently when fed standardized inputs. Binary inputs would 
typically be a 0 or a 1, whereas distributions are often normalized to have a mean of 0 and a Standard deviation of 
1. As weve done here, pixel intensities are generally scaled to range from 0 to 1. 
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the second position, and so on. We arrange the labeis with such one-hot encodings so 
that they line up with the 10 probabilities being output by the final layer of our artifl— 
cial neural network. They represent the ideal output that we are striving to attain with 
our network: If the input irnage is a handwritten seven, then a perfectly trained network 
would output a probability of 1.00 that it is a seven and a probability of 0.00 for each of 
the other nine classes of digits. 

Designing a Neural Network Architecture 

From your authors’ perspective, this is the most pleasurable bit of any script featuring 
deep learning code: architecting the artificial neural net itself. There are infinite possibil- 
lties here, and, as you progress through the book, you will begin to develop an intuition 
that guides the selection of the architectures you might experiment with for tackling a 
given problem. Referring to Figure 5.4, for the time being, were keeping the architec¬ 
ture as elementary as possible in Exaniple 5.6. 

Example 5.6 Keras code to architect a shallow neural network 

model = Sequenti ai() 

model,add(Dense(64, acti vation= 'sigmoid' , input_shape= (784 ,))) 
model.add(Dense(1 0 , acti vation= 'softmax' )) 

In the first line of code, we instantiate the simplest type of neural network model object, 
the Sequenti al type 12 and—in a dash of extreme creativity—name the model model . In 
the second line, we use the add () method of our model object to specify the attributes 
of our network’s hidden layer (64 sigmoid-type artificial neurons in the general-purpose, 
fully connected arrangement defined by the Dense () method) 13 as well as the shape of 
our input layer (one-dimensional array of length 784). In the third and final line we use 
the add () method again to specify the output layer and its parameters: 10 artificial neu¬ 
rons ofthe softmax variety, corresponding to the 10 probabilities (one for each ofthe 10 
possible digits) that the network will output when fed a given handwritten image. 

Training a Deep Learning Model 

Later, we return to the model . summary () and model . compi 1 e () steps of the Shallou’ Net 
in Keras notebook, as well as its three lines of arithmetic. For now, we skip ahead to the 
model-fitting step (shown in Example 5.7). 

Example 5.7 Keras code to train our shallow neural network 

model.fit(X_train, y_train, 

batch_size=1 28 , epochs=200, 
verbose=1, 

val i dati on__data= (X_val i d , y_val id)) 


12. So named because each layer in the network passes information to only the next layer in the sequence of layers. 

13. Once more, these esoteric terms will become comprehensible over the coming chapters. 
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The critical aspects are: 

1 . The f i t () method of our mode! object enables us to train our artificial neural net- 
work with the training images X_trai n as inputs and their associated labeis y_trai n 
as the desired outputs. 

2. As the network trains, the f i t () method also provides us with the option to eval- 
uate the performance of our network by passing our validation data X_val id and 
y_val i d into the val i dati on_data argument. 

3. With machine learning, and especially with deep learning, it is commonplace 
to train our model on the same data multiple times. One pass through ali of our 
training data (60,000 images in the current case) is called one epoch of training. By 
setting the epochs parameter to 200, we cycle through all 60,000 training images 
200 separate times. 

4. By setting verbose to 1 , the model . f i t ( ) method will provide us with plenty of 
feedback as we train. At the moment, we’11 focus on the val_acc statistic that is 
output following each epoch of training. Validation accuracy is the proportion of the 
10,000 handwritten images in X_val i d in which the networks highest probability 
in the output layer corresponds to the correct digit as per the labeis in y_val id. 

Following the first epoch of training, we observe that val_acc equals 0.1010. 14,15 
That is, 10.1 percent of the images from the held-out validation dataset were correctly 
classified by our shallow architecture. Given that there are 10 classes of handwritten digits, 
we’d expect a random process to guess 10 percent ofthe digits correctly by chance, so this 
is not an impressive resuit. As the network continues to train, however, the results im- 
prove. After 10 epochs of training, it is correctly classifying 36.5 percent of the validation 
images—far better than would be expected by chance! And this is only the beginning: 
After 200 epochs, the networks improvement appears to be plateauing as it approaches 
86 percent validation accuracy. Because we constructed an uninvolved, shallow neural- 
network architecture, this is not too shabby! 

Summary 

Putting the cart before the horse, in this chapter we coded up a shallow, elementary 
artificial neural network. With decent accuracy, it is able to classify the MNIST images. 
Over the remainder of Part II, as we dive into theory, unearth artificial-neural-network 
best practices, and layer up to authentic deep learning architectures, we should surely be 
able to classify inputs much more accurately, no? Let’s see. 


14. Artificial neural networks are stochastic (because ofthe way theyre initialized as well as the way they learn), so 
your results will vary slightly from ours. Indeed, ifyou rerun the whole notebook (e.g., by clicking on the Kernel 
option in the Jupyter menu bar and selecting Restart & Run AIT), you should obtain new, slightly different results 
each time you do this. 

15. By the end of Chapter 8, you’ll have enough theory under your belt to study the output model . f i t () in all its 
glory. For our immediate “cart before the horse” purposes, coverage of the validation accuracy metric alone suffices. 
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Artificial Neurons Detecting 

Hot Dogs 


I I aving received tantalizing exposure to applicatioris of deep learniiig in the frrst part of 
this book and having coded up a functioning neural network in Chapter 5, the moment 
has come to delve into the nitty-gritty theory underlying these capabilities. We begin by 
dissecting artificial neurons, the units that—when wired together—constitute an artificial 
neural network. 

Biological Neuroanatomy 101 

As presented in the opening paragraphs of this book, ersatz neurons are inspired by bi¬ 
ological ones. Given that, lefs take a gander at Figure 6.1 for a precis of the first lecture 
in any neuroanatomy course: A given biological neuron receives input into its cell body 
froni many (generally thousands) of dendrites, with each dendrite receiving signals of infor- 
mation from another neuron in the nervous system—a biological neural network. When 
the signal conveyed along a dendrite reaches the cell body, it causes a small change in the 
voltage of the cell body. 1 Some dendrites cause a small positive change in voltage, and the 
others cause a small negative change. If the cumulative effect of these changes causes the 
voltage to increase from its resting state of —70 millivolts to the critical threshold of—55 
millivolts, the neuron will fire something called an acti oh potential away from its cell body, 
down its axon, thereby transmitting a signal to other neurons in the network. 

To summarize, biological neurons exhibit the following three behaviors in sequence: 

1. Receive information from many other neurons 

2. Aggregate this information via changes in cell voltage at the cell body 

3. Transiliit a signal if the cell voltage crosses a threshold level, a signal that can be 
received by many other neurons in the network 


1. More precisely, it causes a change in the voltage difference between the cell’s interior and its surroundings. 
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We’ve aligned the purple, red, and blue colors of the text here with the colors 
(indicating dendrites, ceU body, and the axon, respectively) in Figure 6.1. We’ll do 
this time and again throughout the book, including to discuss key equations and the 
variables they contain. 


The Perceptron 

In the late 1950s, the American neurobiologist Frank Rosenblatt (Figure 6.2) published 
an article on his perceptron, an algorithm influenced by his understanding of biological 
neurons, making it the earhest formulation of an artificial neuron. 2 Analogous to its living 
inspiration, the perceptron (Figure 6.3) can: 

1. Receive input from multiple other neurons 

2. Aggregate those inputs via a simple arithmetic operation called the weighted sum 

3. Generate an output if this weighted sum crosses a threshold level, which can then 
be sent on to many other neurons within a network 

The Hot Dog / Not Hot Dog Detector 

Let’s work through a lighthearted example to understand how the perceptron algorithm 
works. We’re going to look at a perceptron that is specialized in distinguishing whether a 
given object is a hot dog or, well . . . not a hot dog. 

A critical attribute of perceptrons is that they can only be fed binary information as 
inputs, and their output is also restricted to being binary. Thus, our hot dog-detecting 


2. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and the organization in 
the brain. Psychological Review, 65, 386—408. 
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Figure 6.2 The American neurobiology and behavior researcher Frank Rosenblatt. He 
conducted much of his work out of the Corneli Aeronautical Laboratory, including 
physically constructing his Mark I Perceptron there. This machine, an early relic of 
artificial intelligence, can today be viewed at the Smithsonian Institution in 

Washington, D.C. 



Figure 6.3 Schematic diagram of a perceptron, an early artificial neuron. Note the 
structural similarity to the biological neuron in Figure 6.1. 


perceptron must be fed its particular three inputs (indicating whether the object involves 
ketchup, mustard, or a bun, respectively) as either a 0 or a 1. In Figure 6.4: 

■ The first input (a purple 1) indicates the object being presented to the perceptron 
involves ketchup. 

■ The second input (also a purple 1) indicates the object has mustard. 

■ The third input (a purple 0) indicates the object does not include a bun. 

To rnake a prediction as to whether the object is a hot dog or not, the perceptron in- 
dependently weights each of these three inputs. 3 The weights that we arbitrarily selected 


3. Ifyou are well accustomed to regression modeling, this should be a familiar paradigm. 
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mustard? 



Figure 6.4 First example of a hot dog-detecting perceptron: in this instance, it predicts 

there is indeed a hot dog. 


in this (entirely contrived) hot dog example indicate that the presence of a bun, with 
its weight of 6 , is the most influential predictor of whether the object is a hot dog or 
not. The intermediate-weight predictor is ketchup with its weight of 3, and the least 
influential predictor is mustard, with a weight of 2 . 

Let’s determine the weighted sum ofthe inputs: One input at a time (i.e., elementwise), 
we multiply the input by its weight and then sum the individual results. So first, let’s 
calculate the weighted inputs: 

1. For the ketchup input: 3x1 = 3 

2. For mustard: 2x1 = 2 

3. For bun: 6x0 = 0 

With those three products, we can compute that the weighted sum of the inputs is 5: 3 + 
2 + 0. To generalize from this example, the calculation of the weighted sum of inputs is: 


n 

Y^WiXi (6.1) 

i=1 

Where: 

■ Wi is the weight ofa given input i (in our example, W\ = 3, W 2 = 2, and w 3 = 6 ). 

■ x.j is the value of a given input i (in our example, ®i = 1 , »2 = 1 , and a: 3 = 0 ). 

■ WiXi represents the product of u ), and Xi — i.e., the weighted value of a given 
input i. 

■ indicates that we sum ali of the individual weighted inputs W j Xi, where n 
is the total number of inputs (in our example, we had three inputs, but artificial 
neurons can have any number of inputs). 
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The final step of the perceptron algorithm is to evaluate whether the weighted sum of 
the inputs is greater than the neuron s threshold. As with the earlier weights, we have again 
arbitrarily chosen a threshold value for our perceptron example: 4 (shown in red in the 
center of the neuron in Figure 6.4). The perceptron algorithm is: 


n 

y w^i 


> threshold, output 1 
^ threshold, output 0 


( 6 . 2 ) 


Where: 

■ If the weighted sum of a perceptrons inputs is greater than its threshold, then it 
outputs a 1 , indicating that the perceptron predicts the object is a hot dog. 

■ If the weighted sum is less than or equal to the threshold, the perceptron outputs a 
0, indicating that it predicts there is not a hot dog. 

Knowing this, we can wrap up our example from Figure 6.4: The weighted sum of 5 
is greater than the neuron s threshold of 4, and so our hot dog-detecting perceptron 
outputs a 1 . 

Riffmg on our first hot dog example, in Figure 6.5 the object evaluated by the percep¬ 
tron now includes mustard only; there is no ketchup, and it is stili without a bun. In this 
case the weighted sum of inputs comes out to 2. Because 2 is less than the perceptrons 
threshold, the neuron outputs 0, indicating that it predicts this object is not a hot dog. 

In our third and final perceptron example, shown in Figure 6.6, the artificial neuron 
evaluates an object that involves neither mustard nor ketchup but is on a bun. The pres- 
ence of a bun alone corresponds to the calculation of a weighted sum of 6. Because 6 is 
greater than the perceptrons threshold, the algorithm predicts that the object is a hot dog 
and outputs a 1. 


mustard? 



Figure 6.5 Second example of a hot dog-detecting perceptron: In this instance, it 
predicts there is not a hot dog. 
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2 - 0=0 ) 6>4 
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hot dog? 

>1 



Figure 6.6 Third example of a hot dog-detecting perceptron: In this instance, it again 
predicts the object presented to it is a hot dog. 


The Most Important Equation in This Book 

To achieve the formulation of a simplified and universal perceptron equation, we must 
introduce a terni called the bias , which we annotate as b and which is equivalent to the 
negative of an artificial neurons threshold value: 


b = — threshold 


( 6 . 3 ) 


Together, a neuron’s bias and its weights constitute all of its parameters: the changeable 
variables that prescribe what the neuron will output in response to its inputs. 

With the concept of a neuron s bias now available to us, we arrive at the most widely 
used perceptron equation: 


f lifiu-a; + fo>0 
° UtpUt 0 otherwise 


( 6 . 4 ) 


Notice that we made the following five updates to our initial perceptron equation (ffom 
Equation 6.2): 

1. Substituted the bias b in place of the neuron s threshold 

2. Flipped b onto the same side of the equation as all of the other variables 

3. Used the array w to represent all of the w, weights from Wi through to w n 

4. Likewise, used the array x to represent all of the Xi values from X\ through to x n 

5. Used the dot product notation w ■ x to abbreviate the representation of the 
weighted sum of neuron inputs (the longer form of this is shown in Equation 6.1: 

i=1 WiXi) 




Modern Neurons and Activation Functions 



Figure 6.7 The general equation for artificial neurons that we will return to time and 
again. it is the most important equation in this book. 


Right at the heart of the perceptron equation in Equation 6.4 is w ■ x + b, which we 
have cut out for emphasis and placed alone in Figure 6.7. If there is one item you note down 
to rememberfrom this chapter, it should he this three-variable formula, which is an equation that 
represents artificial neurons in general. We refer to this equation many times over the course 
of this book. 


To keep the arithmetic as undemanding as possible in our hot dog-detecting perceptron 
examples, all of the parameter values we made up—the perceptrons weights as well as 
its bias—were positive integers. These parameters could, however, be negative values, 
and, in practice, they would rarely be integers. Instead, parameters are configured as 
float values, which are less clunky. 

Finally, while all of the parameters in these examples were fabricated by us, they 
would usually be learned through the training of artificial neurons on data. In Chap¬ 
ter 8, we cover how this training of neuron parameters is accomplished in practice. 


Modern Neurons and Activation Functions 


Modern artificial neurons—such as those in the hidden layer of the shallow architec- 
ture we built in Chapter 5 (look back to Figure 5.4 or to our Shallow Net in Keras 
notebook)—are not perceptrons. While the perceptron provides a relatively uncom- 
plicated introduction to artificial neurons, it is not used widely today. The most obvious 
restriction of the perceptron is that it receives only binary inputs, and provides only a 
binary output. In many cases, we’d like to make predictions from inputs that are contin- 
uous variables and not binary integers, and so this restriction alone would make percep¬ 
trons unsuitable. 

A less obvious (yet even more critical) corollary of the perceptrons binary-only restric¬ 
tion is that it makes learning rather challenging. Consider Figure 6.8, in which we use a 
new terni, z, as shorthand for the value of the lauded w • x + b equation from Figure 6.7. 
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Figure 6.8 The perceptron’s transition from outputting zero to outputting one happens 
suddenly, making it challenging to gently tune w and b to match a desired output. 


When z is any value less than or equal to zero, the perceptron outputs its smallest possible 
output, 0. If z becomes positive to even the tiniest extent, the perceptron outputs its 
largest possible output, 1. This sudden and extreme transition is not optimal during train- 
ing: When we train a network, we make slight adjustments to w and b based on whether 
it appears the adjustment will improve the networks output. 4 With the perceptron, the 
majority of slight adjustments to w and b would make no difference whatsoever to its 
output; z would generally be moving around at negative values much lower than 0 or 
at positive values much higher than 0. That behavior on its own would be unhelpful, 
but the situation is even worse: Every once in a while, a slight adjustment to w or b will 
cause z to cross from negative to positive (or vice versa), leading to a whopping, drastic 
swing in output from 0 all the way to 1 (or vice versa). Essentially, the perceptron has no 
fmesse—its either yelling or its silent. 

The Sigmoid Neuron 

Figure 6.9 provides an alternative to the erratic behavior of the perceptron: a gentle curve 
from 0 to 1. This particular curve shape is called the sigmoid function and is defined by 
cr(z) = i+e-* ’ w here: 

■ z is equivalent to w ■ x + b. 

■ e is the mathematical constant beginning in 2.718. It is perhaps best known for its 
starring role in the natural exponential function. 

■ a is the Greek letter sigma, the root word for “sigmoid.” 

The sigmoid function is our first example of an artificial neuron activation function. It 
may be ringing a bell for you already, because it was the neuron type that we selected 
for the hidden layer of our Shallow Net in Keras from Chapter 5. As you'11 see as this sec- 
tion progresses, the sigmoid function is the canonical activation function—so much so 
that the Greek letter a (sigma) is conventionally used to denote any activation function. 


4. Improve here means providing output more closely in line with the true output y given some input x. We 
discuss this further in Chapter 8. 
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Figure 6.9 The sigmoid activation function 


The output from any given neurons activation function is referred to siniply as its activa¬ 
tion, and throughout this book, we use the variable term a — as shown along the vertical 
axis in Figure 6.9—to denote it. 

In our view, there is no need to memorize the sigmoid function (or indeed any of 
the activation functions). Instead, we believe it’s easier to understand a given function by 
playing around with its behavior interactively. With that in mind, feel free to join us in 
the Sigmoid Function Jupyter notebook from the books GitHub repository as we work 
through the following lines of code. 

Our only dependency in the notebook is the constant e, which we load using the 
statement from math import e. Next is the fun bit, where we define the sigmoid func¬ 
tion itself: 

def sigmoid (z): 

return 1 / (1+e**-z) 

As depicted in Figure 6.9 and demonstrated by executing si gmoi d (. 00001 ) , near-0 
inputs into the sigmoid function will lead it to return values near 0.5. Increasingly large 
positive inputs will resuit in values that approach 1 . As an extreme example, an input 
of 10000 results in an output of 1.0. Moving more gradually with our inputs — this 
time in the negative direction—we obtain outputs that gently approach 0: As examples, 
sigmoid(-1) returns 0.2689, while sigmoid(-10) returns 4.5398e-05. s 

Any artificial neuron that features the sigmoid function as its activation function is 
called a sigmoid neuron, and the advantage of these over the perceptron should now be tan- 
gible: Small, gradual changes in a given sigmoid neurons parameters w or b cause small, 
gradual changes in z, thereby producing similarly gradual changes in the neurons acti¬ 
vation, a. Large negative or large positive values of z illustrate an exception: At extreme 
z values, sigmoid neurons—like perceptrons—will output Os (when z is negative) or 1 ’s 
(when z is positive). As with the perceptron, this means that subde updates to the weights 


5. The e in 4.5398e-05 should not be confused with the base of the natural logarithm. Used in code outputs, it 
refers to an exponent, so the output is the equivalent of4.5398 X 10 —5 . 
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Figure 6.10 The tanh activation function 


and biases during training will have litde to no effect on the output, and thus learning 
will stall. This situation is called neuron saturatiori and can occur with most activation 
functions. Thankfully, there are tricks to avoid saturation, as you'11 see in Chapter 9. 

The Tanh Neuron 

A popular cousin of the sigmoid neuron is the tanh (pronounced “tanch” in the deep 
learning community) neuron. The tanh activation function is pictured in Figure 6.10 and 
is defined by er(z) = . The shape ofthe tanh curve is similar to the sigmoid 

curve, with the chief distinction being that the sigmoid function exists in the range [0:1], 
whereas the tanh neurons output has the range [—1 : 1]. This difference is more than cos- 
metic. With negative z inputs corresponding to negative a activations, z = 0 correspond- 
ing to a = 0, and positive z corresponding to positive a activations, the output from tanh 
neurons tends to be centered near 0. As we cover further in Chapters 7 through 9, these 
0-centered a outputs usually serve as the inputs x to other artificial neurons in a network, 
and such 0-centered inputs make (the dreaded!) neuron saturation less likely, thereby 
enabling the entire network to learn more efficiently. 

ReLU: Rectified Linear Units 

The final neuron we detail in this book is the rectified linear unit , or ReLU neuron, whose 
behavior we graph in Figure 6.11. The ReLU activation function, whose shape diverges 
glaringly from the sigmoid and tanh sorts, was inspired by properties of biological neu¬ 
rons 6 and popularized within artificial neural networks by Vinod Nair and Geoff Hinton 
(Figure 1.16). 7 The shape of the ReLU function is defined by a = max(0 , z). 


6. The action potentials of biological neurons have only a “positive” firing mode; they have no “negative” fir- 
ing mode. See Hahnloser, R., & Seung, H. (2000). Permitted and forbidden sets in symmetric threshold-linear 
networks. Advances in Neural Information Processing Systems, 13. 

7. Nair, V., & Hinton, G. (2010). Rectified linear units improve restricted Boltzmann machines. Proceedings of the 
International Conference on Machine Learning. 








Modern Neurons and Activation Functions 


95 



This function is uncomplicated: 

■ If z is a positive value, the ReLU activation function returns z (unadulterated) as 
a = z. 

■ If z = 0 or z is negative, the function returns its floor value of 0, that is, the activa¬ 
tion a = 0. 

The ReLU function is one of the simplest functions to imagine that is nonlinear. That 
is, like the sigmoid and tanh functions, its output a does not vary uniformly linearly 
across ali values of z. The ReLU is in essence two distinet linear functions combined 
(one at negative z values returning 0, and the other at positive z values returning z, as 
is visible in Figure 6.11) to forni a straightforward, nonlinear function overall. This non¬ 
linear nature is a critical property of ali activation functions used within deep learning 
architectures. As demonstrated via a series of captivating interactive applets in Chap- 
ter 4 of Michael Nielsens Nemal Networks and Deep Learning e-book, these nonlinearities 
permit deep learning models to approximate any continuous function. 8 This universal 
ability to approximate sonie output y given some input x is one of the hallmarks of deep 
learning—the characteristic that makes the approach so effective across such a breadth of 
applications. 

The relatively simple shape of the ReLU functions particular brand of nonlinearity 
works to its advantage. As you’ll see in Chapter 8, learning appropriate values for w and 
b within deep learning networks involves partial derivative calculus, and these calculus 
operations are more computationally efficient on the linear portions of the ReLU func¬ 
tion relative to its efficiency on the curves of, say, the sigmoid and tanh functions. 9 As a 
testament to its utility, the incorporation ofReLU neurons into AlexNet (Figure 1.17) 
was one of the factors behind its trampling of existing machine vision benchmarks in 
2012 and shepherding in the era of deep learning. Today, ReLU units are the most 
widely used neuron within the hidden layers of deep artificial neural networks, and they 
appear in the majority of the Jupyter notebooks associated with this book. 


8. neuralnetworksanddeeplearning.com/chap4.html 

9. In addition, there is mounting research that suggests ReLU activations encourage parameter sparsity —that is, 
less-elaborate neural-network-level functions that tend to generalize to validation data better. More on model 
generalization coming up in Chapter 9. 
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Choosing a Neuron 

Within a given hidden layer of an artificial neural network, you are able to choose any 
activation function you fancy. With the constraint that you should select a nonlinear 
function if you’d like to be able to approximate any continuous function with your deep 
learning model, youre nevertheless left with quite a bit of room for choice. To assist 
your decision-making process, let s rank the neuron types we’ve discussed in this chapter, 
ordering them from those we recommend least through to those we recommend most: 

1. The perceptron , with its binary inputs and the aggressive step of its binary output, is 
not a practical consideration for deep learning models. 

2. The sigmoid neuron is an acceptable option, but it tends to lead to neural networks 
that train less rapidly than those composed of, say, tanh or ReLU neurons. Thus, 
we recommend limiting your use of sigmoid neurons to situations where it would 
be helpful to have a neuron provide output within the range of [0, l]. 10 

3. The tanh neuron is a solid choice. As we covered earlier, the 0-centered output 
helps deep learning networks learn rapidly. 

4. Our preferred neuron is the ReLU because of how efficiently these neurons enable 
learning algorithms to perform computations. In our experience they tend to lead 
to well-calibrated artificial neural networks in the shortest period of training time. 

In addition to the neurons covered in this chapter, there is a veritable zoo of acti¬ 
vation functions available and the list is ever growing. At time of writing, some of the 
“advanced” activation functions provided by Keras * 11 are the leaky ReLU, the parametric 
ReLU, and the exponentiaI linear unit —all three of which are derivations from the ReLU 
neuron. We encourage you to check these activations out in the Keras documentation 
and read about them on your own time. Furthermore, you are welcome to swap out the 
neurons we use in any ofthe Jupyter notebooks in this book to compare the results. We’d 
be pleasantly surprised if you discover that they provide efficiency or accuracy gains in 
your neural networks that are far beyond the performance of ours. 

Summary 

In this chapter, we detailed the mathematics behind the neural units that make up 
artificial neural networks, including deep learning models. We also summarized the 
pros and cons of the most established neuron types, providing you with guidance on 
which ones you might select for your own deep learning models. In Chapter 7, we cover 
how artificial neurons are networked together in order to learn features from raw data and 
approximate complex functions. 


10. In Chapters 7 and 11, you will encounter a couple of these situations—most notably, with a sigmoid neuron 
as the sole neuron in the output layer of a binary-classifier network. 

11. See keras.io/layers/advanced-activations for documentation. 
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Key Concepts 

As we move through the chapters of the book, we will gradually add ternis to this list of 
key concepts. If you keep these foundational concepts fresh in your mind, you should 
have little difficulty understanding subsequent chapters and, by books end, possess a 
firm grip on deep learning theory and application. The critical concepts thus far are 
as follows. 

■ parameters: 

weight w 

■ bias b 

■ activation a 

■ artificial neurons: 

■ sigmoid 

■ tanh 

■ ReLU 
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Artificial Neural Networks 


I n Chapter 6, we examined the intricacies of artificial neurons. The theme of the current 
chapter is the natural extension of that: We cover how individual neural units are linked 
together to forni artificial neural networks, including deep learning networks. 

The Input Layer 

In our Shallow Net in Keras Jupyter notebook (a schematic of which is available in 
Figure 5.4), we crafted an artificial neural network with the following layers: 

1. An input layer consisting of 784 neurons, one for each of the 784 pixels in an 
MNIST image 

2. A hidden layer composed of 64 sigmoid neurons 

3. An output layer consisting of 10 softmax neurons, one for each of the 10 classes of 
digits 

Of these three, the input layer is the most straightforward to detail. We start with it and 
then move on to discussion of the hidden and output layers. 

Neurons in the input layer don’t perform any calculations; they are siniply placeholders 
for input data. This placeholding is essential because the use of artificial neural networks 
involves performing computations on matrices that have predefmed dimensions. At least 
one of these predefmed dimensions in the network architecture corresponds directly to 
the shape of the input data. 

Dense Layers 

There are many kinds of hidden layers, but as mentioned in Chapter 4, the most gen- 
eral type is the dense layer , which can also be called a fully connected layer. Dense layers are 
found in many deep learning architectures, including the majority of the models we go 
over in this book. Their definition is uncomplicated: Each of the neurons in a given dense 
layer receive information from every one of the neurons in the preceding layer of the 
network. In other words, a dense layer is fully connected to the layer before it! 

While they might not be as specialized nor as efficient as the other flavors of hidden 
layers we dig into in Part III, dense layers are broadly useful, because they can nonlinearly 
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recombine the information provided by the preceding layer of the network. 1 Reviewing 
the TensorFlow Playground demo from the end of Chapter 1, we’re now better posi- 
tioned to appretiate the deep learning model we built. Breaking it down layer by layer, 
the network in Figures 1.18 and 1.19 has the following layers: 

1. An input layer with two neurons: one for storing the vertical position of a given dot 
within the grid on the far right, and the other for storing the dofs horizontal posi¬ 
tion. 

2. A hidden layer composed of eight ReLU neurons. Visually, we can see that this is a dense 
layer because each of the eight neurons in it is connected to (i. e., is receiving infor- 
mation from) both of the input-layer neurons, for a total of 16 (= 8x2) incoming 
connections. 

3. Another hidden layer composed of eight ReLU neurons. We can again discern that this is 
a dense layer because each of its eight neurons receives input from each of the eight 
neurons in the preceding layer, for a total of 64 (= 8 X 8 ) inbound connections. 
Note in Figure 1.19 how the neurons in this layer are nonlinearly recombining the 
straight-edge features provided by the neurons in the first hidden layer to produce 
more-elaborate features like curves and circles. 2 

4. A third dense hidden layer, this one consisting of four ReLU neurons for a total of 32 
(=4x8) connecting inputs. This layer nonlinearly recombines the features from 
the previous hidden layer to learn more-complex features that begin to look directly 
relevant to the binary (orange versus blue) classification problem shown in the grid 
on the right in Figure 1.18. 

5. A fourth and final dense hidden layer. With its two ReLU neurons, it receives a total of 
8 (= 2 X 4) inputs from the previous layer. The neurons in this layer devise such 
elaborate features via nonlinear recombination that they visually approximate the 
overall boundary dividing blue from orange on the far-right grid. 

6 . An output layer made up of a single sigmoid neuron. Sigmoid is the typical choice of 
neuron for a binary classification problem like this one. As shown in Figure 6.9, 
the sigmoid function outputs activations that range from 0 up to 1, allowing us to 
obtain the network’s estimated probability that a given input x is a positive case (a 
blue dot in the current example). Like the hidden layers, the output layer is dense, 
too: Its neuron receives information from both neurons of the final hidden layer for 
a total of 2 (= 1 x 2) connections. 

In summary, every layer within the networks provided by the TensorFlow Playground is a 
dense layer. We can call such a network a dense network, and we’ll experiment with these 
versatile creatures for the remainder of Part II. 3 


1. This statement assumes that the dense layer is made up of neurons with a nonlinear activation function like the 
sigmoid, tanh, and ReLU neurons introduced in Chapter 6, which should be a safe assumption. 

2. By returning to playground.tensorflow.org you can observe these features closely by hovering over these 
neurons with your mouse. 

3. Elsewhere, you may find dense networks referred to as feedforward neural networks or multilayer perceptrons (MLPs). 
We prefer not to use the former term because other model architectures, such as convolutional neural networks 
(formally introduced in Chapter 10), are feedforward networks (that is, any network that doesn’t include a loop) 
as well. Meanwhile, we prefer not to use the latter term because MLPs, confusingly, don’t involve the perceptron 
neural units we cover in Chapter 6. 
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A Hot Dog-Detecting Dense Network 


Let’s further strengthen your coniprehension of dense networks by returning to two old 
flames of ours from Chapter 6: a frivolous hot dog-detecting binary classifier and the 
mathematical notation we used to define artificial neurons. As shown in Figure 7.1, our 
hot dog classifier is no longer a single neuron; in this chapter, it is a dense network of 
artificial neurons. More specifically, with this network architecture, the following differ- 
ences apply: 

■ We have reduced the number of input neurons down to two for simplicity. 

■ The first input neuron, Xi, represents the volume of ketchup (in, say, 
milliUters, which abbreviates to mL) on the object being considered by the 
network. (We are no longer working with perceptrons, so we are no longer 
restricted to binary inputs only.) 

■ The second input neuron, X 2 , represents milliliters of mustard. 

■ We have two dense hidden layers. 

■ The first hidden layer has three ReLU neurons. 

■ The second hidden layer has two ReLU neurons. 

■ The output neuron is denoted by y in the network. This is a binary classification 
problem, so—as outlined in the previous section—this neuron should be sigmoid. 

As in our perceptron examples in Chapter 6, y = 1 corresponds to the presence of 
a hot dog and y = 0 corresponds to the presence ofsome other object. 
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Figure 7.1 A dense network of artificial neurons, highlighting the inputs to the neuron 

labeled a 1 
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Forward Propagation Through the First Hidden Layer 

Having described the architecture of our hot dog-detecting network, lets turn our atten- 
tion to its functionality by focusing on the neuron labeled Gti . 4 This particular neuron, 
like its siblings CI 2 and CL 3 , receives input regarding a given objecfs “ketchup-y-ness” and 
“mustard-y-ness” from Xi and X 2 , respectively. Despite receiving the sarne data as a 2 and 
03 , a\ treats these data uniquely by having its own unique parameters. Remembering 
Figure 6.7, “the most important equation in this book”— w ■ x + b —you may grasp this 
behavior more concretely. Breaking this equation down for the neuron labeled a,\, we 
consider that it has two inputs from the preceding layer: X\ and X2. This neuron also has 
two weights: W\ (which applies to the importance of the ketchup measurement X\) and 
W 2 (which applies to the importance ofthe mustard measurement a^). With these five 
pieces of information we can calculate z, the weighted input to that neuron: 

z = w ■ x + b 

( 7 . 1 ) 

z = (wiXi + W 2 x 2 ) + b 

I 11 turn, with the z value for the neuron labeled a 1 , we can calculate the activation a it 
outputs. Because the neuron labeled a 1 is a ReLU neuron, we use the equation intro- 
duced with respect to Figure 6.11: 


a = max(0,z) ( 7 . 2 ) 

To make this computation of the output of neuron a\ tangible, lets concoct sorne 
numbers and work through the arithmetic together: 

■ X\ is 4.0 mL of ketchup for a given object presented to the network 

■ X 2 is 3.0 mL of mustard for that same object 

■ W\ = —0.5 

■ w 2 = 1.5 

■ 6= -0.9 

To calculate z lets start with Equation 7.1 and then fili in our contrived values: 

z = w ■ x + b 
= w\X\ + W 1 X 2 + b 

= -0.5 x 4.0 + 1.5 x 3.0 -0.9 ( 7 . 3 ) 

= -2+ 4.5-0.9 
= 1.6 


4. Were using a shorthand notation for conveniendy identifying neurons in this chapter. See Appendix A for a 
more precise and formal notation used for neural networks. 
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Finally, to compute a —the activation output of the neuron labeled a\ —we can leverage 
Equation 7.2: 


a = max(0, z) 

= max(0, 1 . 6 ) (7.4) 

= 1.6 

As suggested by the right-facing arrow along the bottom of Figure 7.1, executing 
the calculations through an artificial neural network from the input layer (the x values) 
through to the output layer ( y ) is called forward propagation. Just now, we detailed the 
process for forward propagating through a single neuron in the first hidden layer of our 
hot dog-detecting network. To forward propagate through the remaining neurons of the 
first hidden layer—that is, to calculate the a values for the neurons labeled a 2 and a 3 — 
we would follow the sanie process as we did for the neuron labeled a±. The inputs Xi 
and X 2 are identical for ali three neurons, but despite being fed the same measurements 
of ketchup and mustard, each neuron in the first hidden layer will output a different 
activation a because the parameters W\, W 2 , and b vary for each of the neurons in the 
layer. 

Forward Propagation Through Subsequent Layers 

The process of forward propagating through the remaining layers of the network is 
essentially the same as propagating through the first hidden layer, but for claritys sake, 
let’s work through an example together. I 11 Figure 7.2, we assume that we’ve already 



Figure 7.2 Our hot dog-detecting network from Figure 7.1, now highlighting the 
activation output of neuron ai, which is provided as an input to both neuron 34 and 

neuron a 5 
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calculated the activation value a for each of the neurons in the first hidden layer. Return- 
ing our focus to the neuron labeled a 3 , the activation it outputs (a\ = 1 . 6 ) becomes one 
of the three inputs into the neuron labeled a 4 (and, as highlighted in the figure, this same 
activation of a = 1.6 is also fed as one of the three inputs into the neuron labeled a 5 ). 

To provide an example of forward propagation through the second hidden layer, 
lefs compute a for the neuron labeled a 4 . Again, we employ the all-important equation 
w ■ x + b. For brevity s sake, weve combined it with the ReLU activation function: 


a = max(0, z) 

= max(0 , (w ■ x + b )) (7.5) 

= maa;( 0 , {w\X\ + w 2 x 2 + w 3 x 3 + b)) 


This is sufficiently similar to Equations 7.3 and 7.4 that it would be superfluous to walk 
through the arithmetic again with feigned values. As we propagate through the second 
hidden layer, the only twist is that the layers inputs (i.e., x in the equation w ■ x + b) do 
not come from outside the network; instead they are provided by the first hidden layer. 
Thus, in Equation 7.5: 

■ Xi is the value a = 1 . 6 , which we obtained earlier from the neuron labeled a 3 

■ x 2 is the activation output a (whatever it happens to equal) from the neuron 
labeled a- 2 

■ x 3 is likewise a unique activation a from the neuron labeled 03 

I 11 this manner, the neuron labeled 0,4 is able to nonlinearly recombine the information 
provided by the three neurons of the first hidden layer. The neuron labeled a 3 also non¬ 
linearly recombines this information, but it would do it in its own distinctive way: The 
unique parameters W\, w 2 , w 3 , and b for this neuron would lead it to output a unique a 
activation of its own. 

Having illustrated forward propagation through ali of the hidden layers of our hot dog- 
detecting network, lefs round the process off by propagating through the output layer. 
Figure 7.3 highlights that our single output neuron receives its inputs from the neurons 
labeled a 4 and a 5 . Lefs begin by calculating z for this output neuron. The formula is 
identical to Equation 7.1, which we used to calculate z for the neuron labeled a 3 , except 
that the (contrived, as usual) values we plug in to the variables are different: 


z = w ■ x + b 
= w 3 Xi + w 2 x 2 + b 

= 1.0 x 2.5 + 0.5 x 2.0- 5.5 (7.6) 

= 3.5-5.5 


= - 2.0 
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Figure 7.3 Our hot dog-detecting network, with the activations providing input to the 

output neuron y highlighted 


The output neuron is sigmoid, so to compute its activation a we pass its 2 : value 
through the sigmoid function froni Figure 6.9: 

a = a(z) 

1 

l + e~ z 
1 

“ l + e -(-2-°) 

« 0.1192 

We are lazy, so we didn’t work out the final line of this equation manually. Instead, we 
used the Sigmoid Function Jupyter notebook that we created in Chapter 6 . By executing 
the line si gmoi d (- 2.0) within it, our machine did the heavy lifting for us and kindly 
informed us that a comes out to about 0.1192. 

The activation a computed by the sigmoid neuron in the output layer is a special case, 
because it is the final output of our entire hot dog-detecting neural network. Because it’s 
so special, we assign it a distinctive designation: y. This variable is a version of the letter y 
that wears an object called a caret to keep its head warm, and so we call it “why hat” The 
value represented by y is the network s guess as to whether a given object is a hot dog or 
not a hot dog, and we can express this in probabilistic language. Given the inputs X\ and 
X 2 that we fed into the network—that is, 4.0 mL of ketchup and 3.0 mL of mustard—the 
network estimates that there is an 11.92 percent chance that an object with those partic- 
ular condiment measurements is a hot dog . 5 If the object presented to the network was 


( 7 . 7 ) 


5. Don’t say we didn’t warn you from the start that this was a silly example! Ifwere lucky, its outlandishness will 
make it memorable. 
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indeed a hot dog ( y = 1), then this y of 0.1192 was pretty far off the mark. On the 
other hand, if the object was truly not a hot dog (y = 0), then the y is quite good. 

We formalize the evaluation of y predictions in Chapter 8, but the general notion is that 
the closer y is to the true value y, the better. 


The Softmax Layer of a Fast Food-Classifying 
Network 


As demonstrated thus far in the chapter, the sigmoid neuron suits us well as an output 
neuron ifwere building a network to distinguish two classes, such as a blue dot versus an 
orange dot, or a hot dog versus something other than a hot dog. In many other circum- 
stances, however, you have more than two classes to distinguish between. For example, 
the MNIST dataset consists ofthe 10 numerical digits, so our Shallow Net in Keras from 
Chapter 6 had to accommodate 10 output probabilities—one representing each digit. 

When concerned with a multiclass problem, the solution is to use a softmax layer as 
the output layer of our network. Softmax is in fact the activation function that we spec- 
ified for the output layer in our Shallow Net in Keras Jupyter notebook (Example 5.6), 
but we initially suggested you not concern yourself with that detail. Now, a couple of 
chapters later, the time to unravel softmax has arrived. 

In Figure 7.4, we provide a new architecture that builds upon our binary hot dog 
classifier. The schematic is the same—right down to its volumes-of-ketchup-and-mustard 
inputs—except that instead of having a single output neuron, we now have three. This 
multiclass output layer is stili dense, so each of the three neurons receives information 
from both of the neurons in the final hidden layer. Continuing on with our proclivity for 
fast food, lefs say that now: 

■ y 1 represents hot dogs. 

■ y 2 is for burgers. 

■ y 3 is for pizza. 

Note that with this configuration, there can be no alternatives to hot dogs, burgers, or 
pizza. The assumption is that all objects presented to the network belong to one of these 
three classes of fast food, and one of the classes only. 

Because the sigmoid function applies solely to binary problems, the output neurons 
in Figure 7.4 take advantage of the softmax activation function. Let s use code from our 
Softmax Demo Jupyter notebook to elucidate how this activation function operates. The 
only dependency is the exp function, which calculates the natural exponential ofwhat- 
ever value its given. More specifically, if we pass some value x into it with the command 
exp(x), we will get back e x . The effect of this exponentiation will become ciear as we 
move through the forthcoming example. We import the exp function into the notebook 
by using from math import exp. 

To concoct a particular example, let s say that we presented a slice of pizza to the net¬ 
work in Figure 7.4. This pizza slice has negligible amounts of ketchup and mustard on it, 
and so X\ and a: 2 are near-0 values. Provided these inputs, we use forward propagation to 
pass information through the network toward the output layer. Based on the information 


The Softmax Layer of a Fast Food-Classifying Network 107 



Figure 7.4 Our food-detecting network, now with three softmax neurons in the 

output layer 


that the three neurons receive from the final hidden layer, they individually use our old 
friend w-x+b to calculate three unique (and, for the purposes ofthis example, contrived) 
z values: 

■ z for the neuron labeled y 1 , which represents hot dogs, comes out to -1.0. 

■ For the neuron labeled y 2 , which represents burgers, z is 1.0. 

■ For the pizza neuron y 3 , z comes out to 5.0. 

These values indicate that the network estimates that the object presented to it is most 
likely to be pizza and least likely to be a hot dog. Expressed as z, however, it isn’t straight- 
forward to intuit how much more likely the network predicts the object to be pizza relative 
to the other two classes. This is where the softmax function comes in. 

After importing our dependency, we create a list named z to store our three z values: 

z= [ 1.0, 1.0, 5.0] 

Applying the softmax function to this list involves a three-step process. The first step is to 
calculate the exponential of each of the 2 : values. More explicitly: 

■ exp(z[0]) comes out to 0.3679 for hot dog. 6 

■ exp ( z [ 1 ]) gives us 2.718 for burger. 

■ exp ( z [2]) gives us the much, much larger (exponentially so!) 148.4 for pizza. 


6. Recall that Python uses zero indexing, so z [0] corresponds to the 2 : of neuron y 
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The second step of the softmax function is to sum up our exponentials: 
total = exp(z[0]) + exp(z[1]) + exp(z[2]) 

With this total variable we can execute the third and final step, which provides propor- 
tions for each of our three classes relative to the sum of ali of the classes: 

■ exp(z[0]) /total outputs a y\ value of 0.002428, indicating that the network 
estimates theres a ~0.2 percent chance that the object presented to it is a hot dog. 

■ exp(z[1 ]) /total outputs a y 2 value of 0.01794, indicating an estimated ~1.8 
percent chance that it’s a burger. 

■ exp(z[2])/total outputs a y 3 value of 0.9796, for an estimated ~98.0 percent 
chance that the object is pizza. 

Given this arithmetic, the etymology of the “softmax” name should now be dis- 
cernible: The function returns z with the highest value (the max), but it does so softly. 
That is, instead of indicating that theres a 100 percent chance the object is pizza and a 
0 percent chance it’s either of the other two fast food classes (that would be a hard max 
function), the network hedges its bets, to an extent, and provides a likelihood that the 
object is each of the three classes. This leaves us to niake the decision about how much 
confidence we would require to accept a neural network’s guess. 7 


The use of the softmax function with a single neuron is a special case of softmax that 
is mathematically equivalent to using a sigmoid neuron. 


Revisiting Our Shallow Network 

With the knowledge of dense networks that you’ve developed over the course of this 
chapter, we can return to our Shallow Net in Keras notebook and understand the model 
summary within it. Example 5.6 shows the three lines of Keras code we use to architect a 
shallow neural network for classifying MNIST digits. As detailed in Chapter 5, over those 
three lines of code we instantiate a model object and add layers of artificial neurons to it. 
By calling the summary () method on the model, we see the model-summarizing table 
provided in Figure 7.5. The table has three columns: 

■ Layer (type) : the name and type of each of our layers 

■ Output Shape: the dimensionality of the layer 

■ Param #: the number ofparameters (weights w and biases b) associated with the 
layer 


7. Confidence thresholds may vary based on your particular application, but typically we’d simply accept whichever 
class has the highest likelihood. This class can, for example, be identified with the argmax () (argument maximum) 
function in Python, which returns the index position (i.e., the class label) of the largest value. 
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Layer (type) 

Output 

Shape 

Param # 

dense_l (Dense) 

(None, 

64) 

50240 

dense_2 (Dense) 

(None, 

10) 

650 

Total params: 50,890 

Trainable params: 50,890 

Non-trainable params: 0 


Figure 7.5 A summary of the model object from our Shallow Net in Keras Jupyter 

notebook 


The input layer performs no calculations and never has any of its own parameters, so no 
information on it is displayed directly. The first row in the table, therefore, corresponds to 
the first hidden layer of the network. The table indicates that this layer: 

■ Is called dense_1; this is a default name because we did not designate one explicitly 

■ Is a Dense layer, as we specified in Example 5.6 

■ Is composed of 64 neurons, as we further specified in Example 5.6 

■ Has 50,240 parameters associated with it, broken down into the following: 

■ 50,176 weights, corresponding to each ofthe 64 neurons in this dense layer 
receiving input from each of the 784 neurons in the input layer (64 x 784) 

■ Plus 64 biases, one for each of the neurons in the layer 

■ Giving us a total of 50,240 parameters: 

parameters = w T b = 50176 T 64 — 50240 

The second row ofthe table in Figure 7.5 corresponds to the models output layer. 

The table telis us that this layer: 

■ Is called dense_2 

■ Is a Dense layer, as we specified it to be 

■ Consists of 10 neurons—again, as we specified 

■ Has 650 parameters associated with it, as follows: 

■ 640 weights, corresponding to each ofthe 10 neurons receiving input from 
each of the 64 neurons in the hidden layer (64 x 10) 

■ Plus 10 biases, one for each of the output neurons 

From the parameter counts for each layer, we can calculate for ourselves the Total 
params line displayed in Figure 7.5: 


ntotai — n\ + ri2 


= 50240 + 650 
= 50890 


(7.8) 
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All 50,890 of these parameters are Trai nabl e params because—during the subsequent 
model . f i t () call in the Shallow Net in Kems notebook—they are permitted to be tuned 
during model training. This is the norm, but as youdl see in Part III, there are situa- 
tions when it is fruitful to freeze sonte of the parameters in a model, rendering them 

Non-trainable params. 

Summary 

In this chapter, we detailed how artificial neurons are networked together to approximate 
an output y given sonte inputs x. In the remaining chapters of Part II, we detail how 
a network learns to improve its approximations of y by using training data to tune the 
parameters of its constituent artificial neurons. Simultaneously, we broaden our cover- 
age of best practices for designing and training artificial neural networks so that you can 
include additional hidden layers and form a high-caliber deep learning model. 


Key Concepts 

Here are the essential foundational concepts thus far. New terms froni the current 
chapter are highlighted in purple. 


■ parameters: 

■ weight w 

■ bias b 

■ activation a 

■ artificial neurons: 

■ sigmoid 

■ tanh 

■ ReLU 


■ input layer 

■ hidden layer 

■ output layer 

■ layer types: 

■ dense (fully connected) 

■ softmax 

■ forward propagation 




Training Deep Networks 


I n the preceding chapters, we described artificial neurons comprehensively and we 
walked through the process of forward propagating information through a network of 
neurons to output a prediction, such as whether a given fast food item is a hot dog, a 
juicy burger, or a greasy slice of pizza. In those culinary examples from Chapters 6 and 
7, we fabricated numbers for the neuron parameters—the neuron weights and biases. In 
real-world applications, however, these parameters are not typically concocted arbitrarily: 
They are learned by training the network on data. 

In this chapter, you will become acquainted with two techniques—called gradient 
descent and backpropagation —that work in tandem to learn artificial neural network param¬ 
eters. As usual in this book, our presentation of these methods is not only theoretical: We 
provide pragmatic best practices for implementing the techniques. The chapter culminates 
in the application of these practices to the construction of a neural network with more 
than one hidden layer. 

Cost Functions 


In Chapter 7, you discovered that, upon forward propagating sorne input values ali the 
way through an artificial neural network, the network provides its estimated output, 
which is denoted y. If a network were perfectly calibrated, it would output y values that 
are exactly equal to the true label y. In our binary classifier for detecting hot dogs, for 
example (Figure 7.3), y = 1 indicated that the object presented to the network is a hot 
dog, while y = 0 indicated that it’s something else. In an instance where we have in fact 
presented a hot dog to the network, therefore, ideally it would output y = 1. 

In practice, the gold Standard of y = y is not always attained and so may be an exces- 
sively stringent definition of the “correct” y. Instead, if y = 1 we might be quite pleased 
to see a y of, say, 0.9997, because that would indicate that the network has an extremely 
high confidence that the object is a hot dog. A y of 0.9 might be considered acceptable, 
y = 0.6 to be disappointing, and y = 0.1192 (as computed in Equation 7.7) to be awful. 

To quantify the spectrum of output-evaluation sentiments from “quite pleased” ali the 
way down to “awful,” machine learning algorithms often involve cost functions (also known 
as lossfunctions). The two such functions that we cover in this book are called quadratic 
cost and cross-entropy cost. Let s cover them in turn. 
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Quadratic Cost 

Quadratio cost is one of the simplest cost functions to calculate. It is alternatively called 
mean squared error, which handily describes ali that there is to its calculation: 

1 " 

C= -TriVi ~Vi) 2 (S- 1 ) 

1 i—1 

For any given instance i, we calculate the difFerence (the error) between the true label y i 
and the networks estimated y i . We then square this difFerence, for two reasons: 

1. Squaring ensures that whether y is greater than y or vice versa, the difFerence 
between the two is stated as a positive value. 

2. Squaring penalizes large difFerences between y and y much more severely than 
small difFerences. 

Having obtained a squared error For each instance i by using (y, — yd) 2 , we can then calcu¬ 
late the mean cost C across ali n oF our instances by: 

n 

1. Summing up cost across all instances using ^ 

i=l 

2. Dividing by however many instances we have using ^ 

By taking a peek inside the Quadratio Cost Jupyter notebook from the book s GitHub 
repo, you can play around with Equation 8.1 yourselF. At the top of the notebook, we 
define a function to calculate the squared error for an instance i: 

def squared_error (y, yhat): 
return (y - yhat)**2 

By plugging a true y of 1 and the ideal yhat of 1 in to the function by using 
squared_error ( 1 , 1 ), we observe that—as desired—this perfect estimate is associated 
with a cost of 0. Likewise, minor deviations from the ideal, such as a yhat of 0.9997, 
correspond to an extremely small cost: 9. Oe-08. 1 As the difFerence between y and yhat 
increases, we witness the expected exponential increase in cost: Holding y steady at 1 but 
lowering yhat from 0 . 9 to 0 . 6 , and then to 0 . 1192, the cost chmbs increasingly rapidly 
from 0.01 to 0.1 6 and then to 0.78. As a final bit of amusement in the notebook, we 
note that had y truly been 0, our yhat of 0.1192 would be associated with a small cost: 
0.0142. 

Saturated Neurons 

While quadratic cost serves as a straightforward introduction to loss functions, it has a 
vital flaw. Consider Figure 8.1, in which we recapitulate the tanh activation function 
from Figure 6.10. The issue presented in the figure, called neuron saturation, is com- 
mon across all activation functions, but we’ll use tanh as our Ione exemplar. A neuron is 


1. 9.0e-08 is equivalent to 9.0 X 10 8 . 
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Figure 8.1 Plot reproducing the tanh activation function shown in Figure 6.10, drawing 
attention to the high and low values of z at which a neuron is saturated 


considered saturated when the combination of its inputs and parameters (interacting as per 
“the most important equation,” 2 = w • x + b, which is captured in Figure 6.10) produces 
extreme values of z — the areas encircled with red in the plot in Figure 8.1. In these areas, 
changes in z (via adjustments to the neurons underlying parameters w and b ) cause only 
teensy-weensy changes in the neurons activation a . 2 

Using methods that we cover later in this chapter—namely, gradient descent and 
backpropagation—a neural network is able to learn to approximate y through the tuning 
of the parameters w and b associated with ali of its constituent neurons. In a saturated 
neuron, where changes to w and b lead to only minuscule changes in a, this learning 
slows to a crawl: If adjustments to w and b make no discernible impact on a given neu- 
ron’s activation a, then these adjustments cannot have any discernible impact downstream 
(via forward propagation) on the networks y, its estimate of y. 

Cross-Entropy Cost 

One of the ways 3 to minimize the impact of saturated neurons on learning speed is to 
use cross-entropy cost in lieu of quadratic cost. This alternative loss function is configured 
to enable efficient learning anywhere within the activation function curve of Figure 8.1. 
Because of this, it is a far more popular choice of cost function and it is the selection that 
predominates the remainder of this book. 4 

You need not preoccupy yourself with the equation for cross-entropy cost, but for the 
sake of completeness, here it is: 

1 " 

C = -~Yl\ y i ln £* + (! “ Vi) W 1 - Vi)\ (8-2) 


2. Recall from Chapter 6 that a = &(z), where cr is some activation function—in this example, the tanh function. 

3. More methods for attenuating saturated neurons and their negative effects on a network are covered in Chapter 9. 

4. Cross-entropy cost is well suited to neural networks solving classification problems, and such problems dominate 
this book. For regression problems (covered in Chapter 9), quadratic cost is a better option than cross-entropy cost. 
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The most pertinent aspects of the equation are: 

■ Like quadratio cost, divergence of y from y corresponds to increased cost. 

■ Analogous to the use of the square in quadratio cost, the use of the natural log- 
arithm In in cross-entropy cost causes larger diiferences between y and y to be 
associated with exponentially larger cost. 

■ Cross-entropy cost is structured so that the larger the difference between y and y, 
thefaster the neuron is able to leam. 5 

To make it easier to remember that the greater the cost, the more quickly a neural 
network incorporating cross-entropy cost learns, heres an analogy that would absolutely 
never involve any ofyour esteemed authors: Lets say youre at a cocktail party leading the 
conversation of a group of people that youve met that evening. The strong martini youre 
holding has already gone to your head, and you go out on a limb by throwing a risque 
line into your otherwise charming repartee. Your audience reacts with immediate, visible 
disgust. With this response clearly indicating that your quip was well off the mark, you 
learn pretty darn quickly. Ifs exceedingly unlikely you’11 be repeating the joke anytime 
soon. 

Anyway, thafs plenty enough on disasters of social etiquette. The fmal item to note on 
cross-entropy cost is that, by including y, the formula provided in Equation 8.2 applies 
to only the output layer. Recall from Chapter 7 (specifically the discussion of Figure 7.3) 
that y is a special case of a: Ifs actually just another plain old a value—except that ifs 
being calculated by neurons in the output layer of a neural network. With this in mind, 
Equation 8.2 could be expressed with etj substituted in for y i so that the equation gener- 
alizes neatly beyond the output layer to neurons in any layer of a network: 

1 ” 

C = --'YjlVi In a, + (1 - Vi) ln(l - a,)] (8.3) 

1 i=1 

To cernent all of this theoretical chatter about cross-entropy cost, let s interactively 
explore our aptly named Cross Entropy Cost Jupyter notebook. There is only one depen- 
dency in the notebook: the 1 og function from the NumPy package, which enables us to 
compute the natural logarithm In shown twice in Equation 8.3. We load this dependency 
using f rom numpy import log. 

Next, we define a function for calculating cross-entropy cost for an instance i: 

def cross_entropy (y, a): 

return -1*(y*log(a) + (1-y)* 1og( 1 -a)) 



5. To understand how the cross-entropy cost function in Equation 8.2 enables a neuron with larger cost to learn 
more rapidly, we require a touch of partial-derivative calculus. (Because we endeavor to minimize the use of 
advanced mathematics in this book, weve relegated this calculus-focused explanation to this footnote.) Central to 
the two computational methods that enable neural networks to learn—gradient descent and backpropagation—is 
the comparison ofthe rate ofchange ofcost C relative to neuron parameters like weight w. Using partial-derivative 
notation, we can represent these relative rates of change as The cross-entropy cost function is deliberately 
structured so that, when we calculate its derivative, is related to (y — y). Thus, the larger the difference 
between the ideal output y and the neurons estimated output y, the greater the rate of change of cost C with 
respect to weight w. 
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Table 8.1 Cross-entropy costs associated with selected example inputs 


y 

a 

C 

i 

0.9997 

0.0003 

i 

0.9 

0.1 

i 

0.6 

0.5 

i 

0.1192 

2.1 

0 

0.1192 

0.1269 

i 

1-0.1192 

0.1269 


Plugging the sanie values in to our cross_entropy () function as we did the squared_ 
error () function earlier in this chapter, we observe comparable behavior. As shown in 
Table 8.1, by holding y steady at 1 and gradually decreasing a from the nearly ideal esti- 
mate of 0.9997 downward, we get exponential increases in cross-entropy cost. The table 
further illustrates that—again, consistent with the behavior of its quadratic cousin—cross- 
entropy cost would be low, with an a of 0.1192, if y happened to in fact be 0. These 
results reiterate for us that the chief distinction between the quadratic and cross-entropy 
functions is not the particular cost value that they calculate per se, but rather it is the rate 
at which they learn within a neural net—especially if saturated neurons are involved. 

Optimization: Learning to Minimize Cost 

Cost functions provide us with a quantification of how incorrect our models estimate of 
the ideal y is. This is most helpful because it arms us with a metric we can leverage to 
reduce our networks incorrectness. 

As alluded to a couple of times in this chapter, the primary approach for minimiz- 
ing cost in deep learning paradigms is to pair an approach called gradient descent with 
another one called backpropagation. These approaches are optimizers and they enable 
the network to learn. This learning is accomplished by adjusting the models parameters 
so that its estimated y gradually converges toward the target of y, and thus the cost de- 
creases. We cover gradient descent first and move on to backpropagation immediately 
afterward. 

Gradient Descent 

Gradient descent is a handy, efficient tool for adjusting a models parameters with the aim 
of minimizing cost, particularly if you have a lot of training data available. It is widely 
used across the field of machine learning, not only in deep learning. 

In Figure 8.2, we use a nimble trilobite in a cartoon to illustrate how gradient descent 
works. Along the horizontal axis in each frame is some parameter that we’ve denoted as p. 
In an artificial neural network, this parameter would be either a neurons weight w or bias 
b. In the top frame, the trilobite finds itself on a hili. Its goal is to descend the gradient, 
thereby finding the location with the minimum cost, C. But there s a twist: The trilobite 
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P 



Figure 8.2 A trilobite using gradient descent to find the value of a parameter p 
associated with minimal cost, C 


is blind! It cannot see whether deeper valleys lie far away somewhere, and so it can only 
use its cane to investigate the slope of the terrain in its immediate vicinity. 

The dashed orange line in Figure 8.2 indicates the blind trilobite’s calculation ofthe 
slope at the point where it finds itself. According to that slope line, if the trilobite takes a 
step to the left (i.e., to a slightly lower value of p), it would be moving to a location with 
smaller cost. On the hand, if the trilobite takes a step to the right (a slightly higher value 
ofp), it would be moving to a location with higher cost. Given the trilobite s desire to 
descend the gradient, it chooses to take a step to the left. 

By the middle frame, the trilobite has taken several steps to the left. Here again, we 
see it evaluating the slope with the orange line and discovering that, yet again, a step to 
the left will bring it to a location with lower cost, and so it takes another step left. In the 
lower frame, the trilobite has succeeded in making its way to the location—the value of 
the parameter p —corresponding to the minimum cost. From this position, if it were to 
take a step to the left or to the right, cost would go up, so it gleefully remains in place. 

In practice, a deep learning model would not have only one parameter. It is not 
uncommon for deep learning networks to have milhons of parameters, and some indus- 
trial applications have billions of them. Even our Shallow Net in Kems —one of the smallest 
models we build in this book—has 50,890 parameters (see Figure 7.5). 
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Figure 8.3 A trilobite exploring along two model parameters— p 1 and p 2 —in order to 
minimize cost via gradient descent. In a mountain-adventure analogy, pj and p 2 could be 
thought of as latitude and longitude, and altitude represents cost. 


Although it’s inipossible for the human mind to imagine a billion-dimensional space, 
the two-parameter cartoon shown ili Figure 8.3 provides a sense of how gradient descent 
scales up to minimize cost across multiple parameters simultaneously. Across however 
many trainable parameters there are in a model, gradient descent iteratively evaluates 
slopes 6 to identify the adjustments to those parameters that correspond to the steepest 
reduction in cost. With two parameters, as in the trilobite cartoon in Figure 8.3, for 
example, this procedure can be likened to a blind hike through the mountains, where: 

■ Latitude represents one parameter, say p\. 

■ Longitude represents the other parameter, p 2 - 

■ Altitude represents cost—the lower the altitude, the better! 

The trilobite randomly finds itself at a location in the mountains. From that point, it feels 
around with its cane to identify the direction of the step it can take that will reduce its 
altitude the most. It then takes that single step. Repeating this process many times, the 
trilobite may eventually find itself at the latitude and longitude coordinates that corre¬ 
spond to the lowest-possible altitude (the minimum cost), at which point the trilobite’s 
surreal alpine adventure is complete. 

Learning Rate 

For conceptual simplicity, in Figure 8.4, let’s return to a blind trilobite navigating a 
single-parameter world instead of a two-parameter world. Now lets imagine that we 
have a ray-gun that can shrink or enlarge trilobites. In the middle panel, we’ve used our 
ray-gun to make our trilobite very small. The trilobite’s steps will then be correspond- 
ingly small, and so it will take our intrepid little hiker a long time to find its way to the 


6. Using partial-derivative calculus. 
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Figure 8.4 The learning rate ( 77 ) of gradient descent expressed as the size of a trilobite. 
The middle panel has a small learning rate, and the bottom panel, a large one. 


legendary valley of minimum cost. On the other hand, consider the bottom panel, in 
which we’ve used our ray-gun to make the trilobite very large. The situation here is even 
worse! The trilobite’s steps will now be so large that it will step right over the valley of 
minimum cost, and so it never has any hope of finding it. 

I 11 gradient descent terminology, step size is referred to as learning rate and denoted 
with the Greek letter 77 (eta, pronounced “ee-ta”). Learning rate is the first ofsev- 
eral model hyperparameters that we cover in this book. In machine learning, including 
deep learning, hyperparameters are aspects of the model that we configure before we 
begin training the model. So hyperparameters such as 77 are preset while, in contrast, 
parameters—namely, w and b — are learned during training. 

Getting your hyperparameters right for a given deep learning model often re¬ 
quires some trial and error. For the learning rate 77 , it’s something like the fairy tale of 
“Goldilocks and the Three Bears”: Too small and too large are both inadequate, but 
theres a sweet spot in the middle. More specifically, as we portray in Figure 8.4, if 77 is 
too small, then it will take many, many iterations of gradient descent (read: an unnec- 
essarily long time) to reach the minimal cost. On the other hand, selecting a value for 77 
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that is too large means we might never reach minimal cost at ali: The gradient descent 
algorithm will act erratically as it jumps right over the parameters associated with minimal 
cost. 

Corning up in Chapter 9, we have a elever trick waiting for you that will circumnav¬ 
igate the need for you to manually select a given neural networks 77 hyperparameter. In 
the interim, however, here are our rules of thumb on the topic: 

■ Begin with a learning rate of about 0.01 or 0.001. 

■ Ifyour model is able to learn (i.e., ifeost decreases consistently epoch over epoch) 
but training happens very slowly (i.e., each epoch, the cost decreases only a small 
amount), then increase your learning rate by an order of magnitude (e.g., from 0.01 
to 0 . 1 ). If the cost begins to jump up and down erratically epoch over epoch, then 
you Ve gone too far, so rein in your learning rate. 

■ At the other extreme, if your model is unable to learn, then your learning rate may 
be too high. Try decreasing it by orders of magnitude (e.g., from 0.001 to 0.0001) 
until cost decreases consistently epoch over epoch. For a visual, interactive way to 
get a handle on the erratic behavior of a model when its learning rate is too high, 
you can return to the TensorFlow Playground example from Figure 1.18 and dial 
up the value within the “Learning rate’’ dropdown box. 


Batch Size and Stochastic Gradient Descent 

When we introduced gradient descent, we suggested that it is efficient for machine 
learning problems that involve a large dataset. In the strictest sense, we outright hed to 
you. The truth is that if we have a very large quantity of training data, ordinary gradient 
descent would not work at all because it wouldn’t be possible to fit all of the data into the 
memory (RAM) of our machine. 

Memory isn’t the only potential snag; compute power could cause us headaches, too. 
A relatively large dataset might squeeze into the memory of our machine, but if we tried 
to train a neural network containing millions of parameters with all those data, vanilla 
gradient descent would be highly («efficient because of the computational complexity of 
the associated high-volume, high-dimensional calculations. 

Thankfully, theres a solution to these memory and compute limitations: the stochastic 
variant of gradient descent. With this variation, we split our training data into mini- 
batehes —small subsets of our full training dataset—to render gradient descent both man- 
ageable and productive. 

Although we didn’t focus on it at the time, when we trained the model in our Shal- 
low Net in Keras notebook back in Chapter 5 we were already using stochastic gradient 
descent by setting our optimizer to SGD in the model ,compile() step. Further, in the 
subsequent line of code when we called the model . f i t () method, we set batch_si ze to 
128 to specify the size of our mini-batehes—the number of training data points that we 
use for a given iteration of SGD. Like the learning rate 77 presented earlier in this chapter, 
batch size is also a model hyperparameter. 

Let’s work through some numbers to make the concepts of batehes and stochastic 
gradient descent more tangible. In the MNIST dataset, there are 60,000 training images. 
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With a batch size of 128 images, we then have [468.75] = 469 batches 7 ’ 8 ofgradient 
descent per epoch: 

[ size of training dataset 

number of batches = - 

batch size 

60, 000 images 

= -— (8.4) 

128 images 
= [468.75] 

= 469 

Before carrying out any training, we initialize our network with random values for each 
neurons parameters w and b. 9 To begin the first epoch of training: 

1. We shufHe and divide the training images into mini-batches of 128 images each. 
These 128 MNIST images provide 784 pixels each, which all together constitute 
the inputs x that are passed into our neural network. It’s this shuffling step that puts 
the stochastic (which means random) in “stochastic gradient descent.” 

2. By forward propagation, information about the 128 images is processed by the 
network, layer through layer, until the output layer ultimately produces y values. 

3. A cost function (e.g., cross-entropy cost) evaluates the networks y values against the 
true y values, providing a cost C for this particular mini-batch of 128 images. 

4. To minimize cost and thereby improve the network’s estimates of y given x, the 
gradient descent part of stochastic gradient descent is performed: Every single w 
and b parameter in the network is adjusted proportional to how much each con- 
tributed to the error (i.e., the cost) in this batch (note that the adjustments are scaled 
by the learning rate hyperparameter rf). w 

These four steps constitute a round of training, as summarized by Figure 8.5. 

Figure 8.6 captures how rounds of training are repeated until we run out of training 
images to sample. The sampling in step 1 is done rnthout replacement, meaning that at the 
end of an epoch each image has been seen by the algorithm only once, and yet between 
different epochs the mini-batches are sampled randomly. After a total of 468 rounds, the 
fmal batch contains only 96 samples. 

This marks the end of the first epoch of training. Assuming we’ve set our model up 
to train for further epochs, we begin the next epoch by replenishing our pool with all 
60,000 training images. As we did through the previous epoch, we then proceed through 
a further 469 rounds of stochastic gradient descent . 11 Training continues in this way until 
the total desired number of epochs is reached. 


size of training dataset 
batch size 
60, 000 images 
128 images 


7. Because 60,000 is not perfectly divisible by 128, that 469th batch would contain only 0.75 X 128 = 96 images. 

8. The square brackets we use here and in Equation 8.4 that appear to be missing the horizontal element from the 
bottom are used to denote the calculation of an integer-value ceiling. The whole-integer ceiling of 468.75, for 
example, is 469. 

9. We delve into the particulars of parameter initialization with random values in Chapter 9. 

10. This error-proportional adjustment is calculated during backpropagation. We haven’t covered backpropagation 
explicidy yet, but it s coming up in the next section, so hang on tight! 

11. Because we’re sampling randomly, the order in which we select training images for our 469 mini-batches is 
completely different for every epoch. 
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Round ofTraining: 


1. Sample a mini-batch of xvalues 


2. Forward propagate x through network 
to estimate y with y 


3. Calculate cost C by comparing y and y 


4. Descend gradient of C to adjust w and b, enabling 
x to better predict y 


Figure 8.5 An individual round of training with stochastic gradient descent. Although 
mini-batch size is a hyperparameter that can vary, in this particular case, the mini-batch 
consists of 128 MNIST digits, as exemplified by our hike-loving trilobite carrying a small 

bag of data. 



w and b 
parameter 
update 


backprop * 


J vs 


'V 

C 


Figure 8.6 An outline of the overall process for training a neural network with 
stochastic gradient descent. The entire dataset is shuffled and split into batches. Each 
batch is forward propagated through the network; the outputy is compared to the 
ground truth y and the cost C is calculated; backpropagation calculates the gradients; 
and the model parameters w and b are updated. The next batch (indicated by a dashed 
line) is forward propagated, and so on until ali of the batches have moved through the 
network. Once all the batches have been used, a single epoch is complete and the 
process starts again with a reshuffling of the full training dataset. 
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The total number of epochs that we set our network to train for is yet another hyperpa- 
rameter, by the way. This hyperparameter, though, is one of the easiest to get right: 

■ If the cost on your validation data is going down epoch over epoch, and if your 
final epoch attained the lowest cost yet, then you can try training for additional 
epochs. 

■ Once the cost on your validation data begins to creep upward, that s an indicator 
that your model has begun to ovetfit to your training data because you’ve trained for 
too many epochs. (We elaborate much more on overfitting in Chapter 9.) 

■ There are methods 12 you can use to automatically monitor training and validation 
cost and stop training early if things start to go awry. In this way, you could set 
the number of epochs to be arbitrarily large and know that training will continue 
until the validation cost stops improving—and certainly before the model begins 
overfitting! 

Escapingthe Local Minimum 

In ali of the examples of gradient descent thus far in the chapter, our hiking trilobite has 
encountered no hurdles on its journey toward minimum cost. There are no guarantees 
that this would be the case, however. Indeed, such smooth sailing is unusual. 

Figure 8.7 shows the mountaineering trilobite exploring the cost of some new model 
that is being used to solve some new problem. With this new problem, the relationship 
between the parameter p and cost C is more complex. To have our neural network esti- 
mate y as accurately as possible, gradient descent needs to identify the parameter values 
associated with the lowest-attainable cost. However, as our trilobite makes its way from 
its random starting point in the top panel, gradient descent leads it to getting trapped in 
a local minimum. As shown in the middle panel, while our intrepid explorer is in the local 
minimum, a step to the left or a step to the right both lead to an increase in cost, and so 
the blind trilobite stays put, completely oblivious of the existence of a deeper valley—the 
global minimum —lying yonder. 

AU is not lost, friends, for stochastic gradient descent comes to the rescue here again. 
The sampling of mini-batches can have the effect of smoothing out the cost curve, as 
exemplified by the dashed curve shown in the bottom panel of Figure 8.7. This smooth¬ 
ing happens because the estimate is noisier when estimating the gradient from a smaller 
mini-batch (versus from the entire dataset). Although the actual gradient in the local 
minimum truly is zero, estimates of the gradient from small subsets of the data don’t pro¬ 
vide the complete picture and might give an inaccurate reading, causing our trilobite to 
take a step left thinking there is a gradient when there really isn’t one. This noisiness and 
inaccuracy is paradoxically a good thing! The incorrect gradient may resuit in a step that 
is large enough for the trilobite to escape the local valley and continue making its way 
down the mountain. Thus, by estimating the gradient many times on these mini-batches, 
the noise is smoothed out and we are able to avoid local minima. In summary, although 
each mini-batch on its own lacks complete information about the cost curve, in the long 
run—over a large number of mini-batches—this tends to work to our advantage. 


12. See keras.io/cal1backs/#earlystopping. 
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Figure 8.7 A trilobite applying vanilla gradient descent from a random starting point 
(top panel) is ensnared by a locai minimum of cost (middle panel). By turning to 
stochastic gradient descent in the bottom panel, the daring trilobite is able to bypass the 
locai minimum and make its way toward the global minimum. 

Like the learning rate hyperparameter 77 , there is also a Goldilocks-style sweet spot for 
batch size. It the batch size is too large, the estimate of the gradient of the cost function is 
far more accurate. In this way, the trilobite has a more exacting impression of the gradient 
in its immediate vicinity and is able to take a step (proportional to rf) in the direction of 
the steepest possible descent. However, the model is at risk of becoming trapped in locai 
minima as described in the preceding paragraph . 13 Besides that, the model might not fit 
in memory 011 your machine, and the compute time per iteration of gradient descent 
could be very long. 

On the other hand, if the batch size is too small, each gradient estimate may be exces- 
sively noisy (because a very small subset of the data is being used to estimate the gradient 
of the entire dataset) and the corresponding path down the mountain will be unneces- 
sarily circuitous; training will take longer because of these erratic gradient descent steps. 
Furthermore, youre not taking advantage of the memory and compute resources on your 


13. It’s worth noting that the learning rate rj plays a role here. If the size of the locai minimum was smaller than 
the step size, the trilobite would likely breeze right past the locai minimum, akin to how we step over cracks in 
the sidewalk. 
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machine. 14 With that in mind, here are our rules of thumb for finding the batch-size 
sweet spot: 

■ Start with a batch size of 32. 

■ If the mini-batch is too large to fit into memory on your machine, try decreasing 
your batch size by powers of 2 (e.g., from 32 to 16). 

■ Ifyour model trains well (i.e., cost is going down consistently) but each epoch 
is taking very long and you are aware that you have RAM to spare, 15 you could 
experiment with increasing your batch size. To avoid getting trapped in local min¬ 
ima, we don’t recommend going beyond 128. 

Backpropagation 

Although stochastic gradient descent operates well on its own to adjust parameters and 
minimize cost in many types of machine learning models, for deep learning models in 
particular there is an extra hurdle: We need to be able to efficiently adjust parameters 
through multiple layers of artificial neurons. To do this, stochastic gradient descent is part- 
nered up with a technique called backpropagation. 

Backpropagation—or backprop for short—is an elegant application of the “chain rule” 
from calculus. 16 As shown along the bottom ol Figure 8.6 and as suggested by its very 
name, backpropagation courses through a neural network in the opposite direction of 
forward propagation. Whereas forward propagation carries Information about the input 
x through successive layers of neurons to approximate y with y, backpropagation carries 
information about the cost C backwards through the layers in reverse order and, with the 
overarching aim of reducing cost, adjusts neuron parameters throughout the network. 

Although the nitty-gritty of backpropagation has been relegated to Appendix B, it’s 
worth understanding (in broad strokes) what the backpropagation algorithm does: Any 
given neural network model is randomly initialized with parameter ( w and b) values 
(such initialization is detailed in Chapter 9). Thus, prior to any training, when the first 
x value is fed in, the network outputs a random guess at y. This is unlikely to be a good 
guess, and the cost associated with this random guess will probably be high. At this point, 
we need to update the weights in order to minimize the cost—the very essence of ma¬ 
chine learning. To do this within a neural network, we use backpropagation to calculate 
the gradient of the cost function with respect to each weight in the network. 


14. Stochastic gradient descent with a batch size of 1 is known as Online learning. Its worth noting that this is 
not the fastest method in terms of compute. The matrix multiplication associated with each round of mini-batch 
training is highly optimized, and so training can be several orders of magnitude quicker when using moderately 
sized mini-batches relative to ordine learning. 

15. On a Unix-based operating system, including macOS, RAM usage may be assessed by running the top or 
htop command within a Terminal window. 

16. To elucidate the mathematics underlying backpropagation, a fair bit of partial-derivative calculus is necessary. 
While we encourage the development ofan in-depth understanding ofthe beauty of backprop, we also appreciate 
that calculus might not be the most appetizing topic for everyone. Thus, we’ve placed our content on backprop 
mathematics in Appendix B. 
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Recall from our mountaineering analogies earlier that the cost function represents a 
hiking trail, and our trilobite is trying to reach basecanip. At each step along the way, 
the trilobite finds the gradient (or the slope) of the cost function and moves doum that 
gradient. That movement corresponds to a weight update: By adjusting the weight in 
proportion to the cost function s gradient with respect to that weight, backprop adjusts that 
weight in a direction that reduces the cost. 

Reflecting back on the “most important equation” from Figure 6.7 (w ■ x + b), and 
remembering that neural networks are stacked with information forward propagating 
through their layers, we can grasp that any given weight in the network contributes to 
the final y output, and thus the cost C. Using backpropagation, we move layer-by-layer 
backwards through the network, starting at the cost in the output layer, and we fmd the 
gradients of every single parameter. A given parameters gradient can then be used to 
adjust the parameter up or down (by an increment corresponding to the learning rate 
rj )—whichever of the two directions is associated with a reduction in cost. 

We appretiate that this is not the lightest section of this book. If theres only one thing 
you take away, let it be this: Backpropagation uses cost to calculate the relative contri- 
bution by every single parameter to the total cost, and then it updates each parameter 
accordingly. In this way, the network iteratively reduces cost and, well . . . learns! 

Tuning Hidden-Layer Count and Neuron Count 

As with learning rate and batch size, the number of hidden layers you add to your neural 
network is also a hyperparameter. And as with the previous two hyperparameters, there 
is yet again a Goldilocks sweet spot for your networks count of layers. Throughout this 
book, we’ve reiterated that with each additional hidden layer within a deep learning net¬ 
work, the more abstract the representations that the network can represent. That is the 
primary advantage of adding layers. 

The rfwadvantage of adding layers is that backpropagation becomes less effective: As 
demonstrated by the plot of learning speed across the layers of a five-hidden-layer net¬ 
work in Figure 8.8, backprop is able to have its greatest impact on the parameters of the 
hidden layer of neurons closest to the output y} 1 The farther a layer is from y, the more 
diluted the effect of that layer’s parameters on the overall cost. Thus, the fifth layer, which 
is closest to the output y, learns most rapidly because those weights are associated with 
larger gradients. In contrast, the third hidden layer, which is several layers away from the 
output layer’s cost calculation, learns about an order of magnitude more slowly than the 
fifth hidden layer. 

Given the above, our rules of thumb for selecting the number of hidden layers in a 
network are: 

■ The more abstract the ground-truth value y you'd like to estimate with your net¬ 
work, the more helpful additional hidden layers may be. With that in mind, we 
recommend starting off with about two to four hidden layers. 


17. Ifyoure curious as to how we made Figure 8.8, check out our Measuring Speed of Learning Jupyter notebook. 
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Figure 8.8 The speed of learning over epochs of training for a deep learning network 
with five hidden iayers. The fifth hidden layer, which is closest to the outputy, learns 
about an order of magnitude more quickly than the third hidden layer. 


■ If reducing the number of Iayers does not increase the cost you can achieve on 
your validation dataset, then do it. Following the problem-solving principle called 
Occam’s razor, the simplest network architecture that can provide the desired resuit is 
the best; it will train more quickly and require fewer compute resources. 

■ On the other haud, ifincreasing the number of Iayers decreases the validation cost, 
then you should pile up those Iayers! 

Not only is network depth a model hyperparameter, but the number of neurons in a 
given layer is, too. If you have many Iayers in your network, then there are many Iayers 
you could be fine-tuning your neuron count in. This may seem intimidating at first, but 
ifs nothing to be too concerned about: A few too many neurons, and your network will 
have a touch more computational complexity than is necessary; a touch too few neurons, 
and your network’s accuracy may be held back imperceptibly. 

As you build and train more and more deep learning models for more and more prob- 
lems, yotfll begin to develop a sense for how many neurons might be appropriate in a 
given layer. Depending on the particular data you Ve modeling, there may be lots of low- 
level features to represent, in which case you might want to have more neurons in the 
network’s early Iayers. If there are lots of higher-level features to represent, then you may 
benefit from having additional neurons in its later Iayers. To determine this empirically, 
we generally experiment with the neuron count in a given layer by varying it by powers 
of2. If doubling the number of neurons from 64 to 128 provides an appreciable improve- 
ment in model accuracy, then go for it. Rehashing Occanas razor, however, consider this: 
It halving the number of neurons from 64 to 32 doesn’t detract from model accuracy, 
then thafs probably the way to go because you Ve reducing your models computational 
complexity with no apparent negative effects. 
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An Intermediate Net in Keras 


To wrap up this chapter, lets incorporate the new theory we’ve covered into a neural net- 
work to see if we can outperform our previous Shallow Net in Keras model at classifying 
handwritten digits. 

The first few stages of our Intermediate Net in Keras Jupyter notebook are identical 
to those of its Shallow Net predecessor. We load the same Keras dependencies, load the 
MNIST dataset in the same way, and preprocess the data in the same way. As shown in 
Example 8.1, the situation begins to get interesting when we design our neural network 
architecture. 

Example 8.1 Keras code to architect an intermediate-depth neural network 

model = Sequential() 

model.add(Dense(64, acti vation= ' rei u' , input_shape=(784, ))) 
model.add(Dense (64 , acti vation= 'reiu' )) 
model.add(Dense(1 0 , acti vation= 'softmax' )) 

The first line of this code chunk, model = Sequenti al (), is the same as before (refer 
to Example 5.6); this is our instantiation of a neural network model object. It’s in the 
second line that we begin to diverge. In it, we specify that we ll substitute the sigmoid 
activation function in the first hidden layer with our most-highly-recommended neuron 
from Chapter 6, the rei u. Other than this activation function swap, the first hidden layer 
remains the same: It stili consists of 64 neurons, and the dimensionality ofthe 784-neuron 
input layer is unchanged. 

The other significant change in Example 8.1 relative to the shallow architecture of 
Example 5.6 is that we specify a second hidden layer of artificial neurons. By calling the 
model . add () method, we nearly effortlessly add a second Dense layer of 64 rei u neu¬ 
rons, providing us with the notebooks namesake: an intermediate-depth neural network. 
With a call to model . summary(), you can see from Figure 8.9 that this additional layer 
corresponds to an additional 4,160 trainable parameters relative to our shallow architec¬ 
ture (refer to Figure 7.5). We can break these parameters down into: 

■ 4,096 weights, corresponding to each of the 64 neurons in the second hidden 
layer densely receiving input from each of the 64 neurons in the first hidden layer 
(64 x 64 = 4,096) 

■ Plus 64 biases, one for each of the neurons in the second hidden layer 

■ Givmg us a total of4,160 parameters: n par ameters =n w + rifc = 4,096 + 64 = 4,160 
In addition to changes to the model architecture, we’ve also made changes to the 

parameters we specify when compiling our model, as shown in Example 8.2. 

Example 8.2 Keras code to compile our intermediate-depth neural network 

model .compile(loss='categorical_crossentropy ' , 
optimizer=SGD(lr=0 .1 ), 
metrics=[ 'accuracy' ]) 
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Layer (type) 

Output Shape 

Param # 

dense_l (Dense) 

(None, 64) 

50240 

dense_2 (Dense) 

(None, 64) 

4160 

dense_3 (Dense) 

(None, 10) 

650 

Total params: 55,050 

Trainable params: 55,050 

Non-trainable params: 0 


Figure 8.9 A summary of the model object from our Intermediate Net in Keras Jupyter 

notebook 


With these lines from Example 8.2, we: 

■ Set our loss function to cross-entropy cost by using 1 oss= ' categori cal_crossentropy' 
(in Shallow Net in Keras, we used quadratic cost by using 

1 oss='mean_squared_error') 

■ Set our cost-minimizing method to stochastic gradient descent by using 
optimizer=SGD 

■ Specify our SGD learning rate hyperparameter tj by setting 1 r=0.1 ls 

■ Indicate that, in addition to the Keras default ofproviding feedback on loss, by 
setting metri cs= [' accuracy' ], we’d also like to receive feedback on model accu- 

19 

racy 

Finally, we train our intermediate net by running the code in Example 8.3. 

Example 8.3 Keras code to train our intermediate-depth neural network 

model.fit(X_train, y_train, 

batch_size=128, epochs=20, 
verbose=1, 

validation_data=(X_valid, y_valid)) 

Relative to the way we trained our shallow net (see Example 5.7), the only change we’ve 
made is reducing our epochs hyperparameter from 200 down by an order of magnitude 



18. On your own time, you can play around with increasing this learning rate by several orders of magnitude as 
well as decreasing it by several orders of magnitude, and observing how it impacts training. 

19. Although loss provides the most important metric for tracking a modefs performance epoch over epoch, its 
particular values are specific to the characteristics of a given model and are not generally interpretable or comparable 
between models. Because of this, other than knowing that we would like our loss to be as close to zero as possible, 
it can be an esoteric exercise to interpret how close to zero loss should be for any particular model. Accuracy, 
on the other hand, is highly interpretable and highly generalizable: We know exactly what it means (e.g., “The 
shallow neural network correctly classified 86 percent of the handwritten digits in the validation dataset”), and 
we can compare this classification accuracy to any other model (“The accuracy of 86 percent is worse than the 
accuracy of our deep neural network”). 
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- Is lSus/step - loss: 0.4744 - acc: 0.8637 - val_loss: 0.2686 - val.acc: 0.9234 

- ls I2us/step - loss: 0.2414 - acc: 0.9289 - val_loss: 0.2004 - val_acc: 0.9404 

- ls 12us/step - loss: 0.1871 - acc: 0.9452 - val_loss: 0.1578 - val_acc: 0.9521 

- ls 12us/step - loss: 0.1538 - acc: 0.9551 - val_loss: 0.1435 - val_acc: 0.9574 


Figure 8.10 The performance of our intermediate-depth neural network over its first 

four epochs of training 


to 20. As you’11 see, our much-more-efficient intermediate architecture required far fewer 
epochs to train. 

Figure 8.10 provides the results of the first three epochs of training the network. 
Recalling that our shallow architecture plateaued as it approached 86 percent accuracy on 
the validation dataset after 200 epochs, our intermediate-depth network is clearly supe¬ 
rior: The val_acc field shows that we attained 92.34 percent accuracy after a single epoch 
of training. This accuracy climbs to more than 95 percent by the third epoch and appears 
to plateau around 97.6 percent by the twentieth. My, how far we’ve come already! 

Let’s break down the verbose model . f i t () output shown in Figure 8.10 in further 
detail: 

■ The progress bar shown next filis in over the course of the 469 “rounds of training” 
(Figure 8.5): 

60000/60000 [==============================] 

■ ls 15us/step indicates that all 469 rounds in the first epoch required 1 second to 
train, at an average rate of 15 microseconds per round. 

■ loss shows the average cost on our training data for the epoch. For the first epoch 
this is 0.4744, and, epoch over epoch, this cost is reliably minimized via stochastic 
gradient descent (SGD) and backpropagation, eventually diminishing to 0.0332 by 
the twentieth epoch. 

■ acc is the classification accuracy on training data for the epoch. The model cor- 
rectly classified 86.37 percent for the first epoch, increasing to more than 99 
percent by the twentieth. Because a model can overfit to the training data, one 
shouldn’t be overly impressed by high accuracy on the training data. 

■ Thankfully, our cost on the validation dataset (val_loss) does generally decrease 
as well, eventually plateauing as it approaches 0.08 over the final five epochs of 
training. 

■ Corresponding to the decreasing cost of the validation data is an increase in accu¬ 
racy (val_acc). As mentioned, validation accuracy plateaued at about 97.6 percent, 
which is a vast improvement over the 86 percent of our shallow net. 


Summary 

We covered a lot of ground in this chapter. Starting from an appreciation of how a 
neural network with fixed parameters processes information, we developed an un- 
derstanding of the cooperating methods—cost functions, stochastic gradient descent, 
and backpropagation—that enable network parameters to be learned so that we can 
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approximate any y that has a continuous relationship to some input x. Along the way, 
we introduced several network hyperparameters, including learning rate, mini-batch 
size, and number of epochs of training—as well as our rules of thumb for configuring 
each of these. The chapter concluded by applying your newfound knowledge to develop 
an intermediate-depth neural network that greatly outperformed our previous, shallow 
network on the same handwritten-digit-classification task. Up next, we have techniques 
for improving the stability of artificial neural networks as they deepen, enabling you to 
architect and train a bona fide deep learning model for the first time. 


Key Concepts 

Here are the essential foundational concepts thus far. New terms from the current 


chapter are highlighted in purple. 

■ parameters: 

■ weight w 

■ bias b 

■ activation a 

■ artificial neurons: 

■ sigmoid 

■ tanh 

■ ReLU 

■ input layer 

■ hidden layer 

■ output layer 

■ layer types: 

■ dense (fully connected) 

■ softmax 


■ cost (loss) functions: 

■ quadratic (mean squared 
error) 

■ cross-entropy 

■ forward propagation 

■ backpropagation 

■ optimizers: 

■ stochastic gradient descent 

■ optimizer hyperparameters: 

■ learning rate r/ 

■ batch size 




Improving Deep Networks 


I n Chapter 6, we detailed individual artificial neurons. In Chapter 7, we arranged these 
neural units together as the nodes of a network, enabling the forward propagation of some 
input x through the network to produce some output y. Most recently, in Chapter 8, we 
described how to quantify the inaccuracies of a network (compare y to the true y with a 
cost function) as well as how to minimize these inaccuracies (adjust the network parame- 
ters w and b via optimization with stochastic gradient descent and backpropagation). 

In this chapter, we cover common barriers to the creation of high-performing neural 
networks and techniques that overcome them. We apply these ideas directly in code while 
architecting our first deep neural network. 1 Combining this additional network depth 
with our newfound best practices, we’ll see if we can outperform the handwritten-digit 
classification accuracy of our simpler, shallower architectures from previous chapters. 

Weight Initialization 

In Chapter 8, we introduced the concept of neuron saturation (see Figure 8.1), where 
very low or very high values of z diminish the capacity for a given neuron to learn. 

At the time, we offered cross-entropy cost as a solution. Although cross-entropy does 
effectively attenuate the effects of neuron saturation, pairing it with thoughtful weight 
initialization will reduce the likelihood of saturation occurring in the first place. As men- 
tioned in a footnote in Chapter 1, modern weight initialization provided a significant 
leap forward in deep learning capability: It is one of the landmark theoretical advances 
between LeNet-5 (Figure 1.11) and AlexNet (Figure 1.17) that dramatically broadened 
the range of problems artificial neural networks could reliably solve. In this section, we 
play around with several weight initializations to help you develop an intuition around 
how theyre so impactful. 

While describing neural network training in Chapter 8, we mentioned that the param- 
eters w and b are initialized with random values such that a network’s starting approxi- 
mation of y will be far off the mark, thereby leading to a high initial cost C . We haven’t 
needed to dwell on this much, because, in the background, Keras by default constructs 


1. Recall from Chapter 4 that a neural network earns the deep moniker if it consists of at least three hidden layers. 
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TensorFlow models that are initialized with sensible values for w and b. It’s neverthe- 
less worthwhile discussing this initialization, not only to be aware of another method 
for avoiding neuron saturation but also to fili in a gap in your understanding of how 
neural network training works. Although Keras does a sensible job of choosing default 
values—and thafs a key benefit of using Keras in the first place—it’s certainly possible, 
and sometimes even necessary, to change these defaults to suit your problem. 

To make this section interactive, we encourage you to check out our accompanying 
Jupyter notebook, Weight Initialization. As shown in the following chunk of code, our 
library dependencies are NumPy (for numerical operations), matplotlib (for generating 
plots), and a handful of Keras methods, which we will detail as we work through them in 
this section. 

import numpy as np 

import matpl otl i b . pypl ot as plt 

from keras import Sequential 

from keras.layers import Dense, Activation 

from keras.i nitializers import Zeros, RandomNormal 

from keras.i nitializers import glorot_normal, glorot_uniform 

In this notebook, we simulate 784 pixel values as inputs to a single dense layer of artificial 
neurons. The inspiration behind our simulation of these 784 inputs comes of course from 
our beloved MNIST digits (Figure 5.3). For the number of neurons in the dense layer 
(256), we picked a number large enough so that, when we make some plots later on, they 
have ample data: 

n_input = 784 
n_dense = 256 

Now for the impetus of this section: the initialization of the network parameters w 
and b. Before we begin passing training data into our network, we’d like to start with 
reasonably scaled parameters. This is for two reasons. 

1. Large w and b values tend to correspond to larger z values and therefore saturated 
neurons (see Figure 8.1 for a plot on neuron saturation). 

2. Large parameter values would iinply that the network has a strong opinion about 
how x is related to y, but before any training on data has occurred, any such strong 
opinions are wholly unmerited. 

Parameter values of zero, on the other hand, imply the weakest opinion on how x is 
related to y. To bring back the fairy tale yet again, we’re aiming for a Goldilocks-style, 
middle-of-the-road approach that starts training from a balanced and learnable begin- 
ning. With that in mind, when we design our neural network architecture, we select the 
Zeros ( ) method for initializing the neurons of our dense layer with b = 0: 

b_init = Zeros() 
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Following the line of thinking from the preceding paragraph to its natural conclu- 
sion, we might be tempted to think that we should also initialize our network weights w 
with zeros as well. In fact, this would be a training disaster: II ali weights and biases were 
identical, many neurons in the network would treat a given input x identically, giving 
stochastic gradient descent a minimum of heterogeneity for identifying individual param- 
eter adjustments that might reduce the cost C . It would be more productive to initialize 
weights with a range of different values so that each neuron treats a given x uniquely, 
thereby providing SGD with a wide variety of starting points for approximating y. By 
chance, some of the initial neuron outputs may contribute in part to a sensible mapping 
from x to y. Although this contribution will be weak at first, SGD can experiment with 
it to determine whether it might contribute to a reduction in the cost C between the 
predicted y and the target y. 

As worked through earlier (e.g., in discussion of Figures 7.5 and 8.9), the vast majority 
of the parameters in a typical network are weights; relatively few are biases. Thus, it’s 
acceptable (indeed, its the most common practice) to initialize biases with zeros, and the 
weights with a range of values near zero. One straightforward way to generate random 
values near zero is to sample from the Standard normal distribution 2 as in Example 9.1. 

Example 9.1 Weight initialization with values sampled from Standard normal 
distribution 

w_init = RandomNormal(stddev=1.0) 

To observe the impact of the weight initialization we’ve chosen, in Example 9.2 we 
design a neural network architecture for our single dense layer of sigmoid neurons. 

Example 9.2 Architecture for a single dense layer of sigmoid neurons 

model = Sequential() 
model,add(Dense(n_dense, 

input_dim=n_input, 
kernel_initial i zer=w_init, 
bias_initializer=b_init)) 
model.add(Acti vation(' sigmoid' )) 

As in all of our previous examples, we instantiate a model by using Sequenti al () . We 
then use the add () method to create a single Dense layer with the following parameters: 

■ 256 neurons (n_dense) 

■ 784 tnputs (n_input) 

■ kernel_i ni ti al i zer set to w_i ni t to initialize the network weights via our de- 
sired approach, in this case sampling from the Standard normal distribution 

■ bi as_i ni ti al i zer set to b_i ni t to initialize the biases with zeros 


2. The normal distribution is also known as the Gaussian distribution or, colloquially, as the “bell curve” because 
of its bell-like shape. The Standard normal distribution in particular is a normal distribution with a mean of 0 and 
Standard deviation of 1. 
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For simplicity when updating it later ili this section, we add the sigmoid activation func- 
tion to the layer separately by using Acti vati on (' si gmoi d 1 ). 

With our network set up, we use the NumPy random() method to generate 784 
“pixel values,” which are floats randonily sampled froni the range [0.0,1.0): 

x = np.random.random( (1 ,n_input)) 

We subsequently use the predi ct () method to forward propagate x through the single 
layer and output the activations a: 

a = model.predict(x) 

With our final line of code, we use a histogram to visualize the a activations: 3 
_ = plt.hi st(np.transpose(a)) 

Your resuit will look slightly different frorn ours because of the random () method we 
used to generate our input values, but your outputs should look approximately like those 
shown in Figure 9.1. 

As expected given Figure 6.9, the a activations output from our sigmoid layer 
of neurons is constrained to a range from 0 to 1. What is undesirable about these 
activations, however, is that they are chiefly pressed up against the extremes of the range: 
Most of them are either immediately adjacent to 0 or immediately adjacent to 1. This 
indicates that with the normal distribution from which we sampled to initialize the layer’s 
weights w, we ended up encouraging our artificial neurons to produce large z values. 
This is unwelcome for two reasons mentioned earlier in this section: 

1. It means the vast majority of the neurons in the layer are saturated. 

2. It implies that the neurons have strong opinions about how x would influence y 
prior to any training on data. 

120 
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eo 

eo 
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0 

Figure 9.1 Histogram of the a activations output by a layer of sigmoid neurons, with 
weights initialized using a Standard normal distribution 



3. In case youre wondering, the leading underscore (_ =) keeps theJupyter notebook tidier by outputting the plot 
only, instead of the plot as well as an object that Stores the plot. 
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Thankfully, this ickiness can be resolved by initializing our network weights with values 
sampled from alternative distributions. 

Xavier Glorot Distributions 

In deep learning circles, popular distributions for sampling weight-initialization values 
were devised by Xavier Glorot and Yoshua Bengio 4 (portrait provided in Figure 1.10). 
These Glorot distributions , as they are typically called, 5 are tailored such that sampling 
from them will lead to neurons initially outputting small z values. Let s examine them 
in action. Replacing the standard-normal-sampling code (Example 9.1) of our Weight 
Initialization notebook with the line in Example 9.3, we sample from the Glorot normal 
distributiori 6 instead. 

Example 9.3 Weight initialization with values sampled from Glorot normal 
distribution 

w_init = glorot_normal() 

By restarting and rerunning the notebook, 7 you should now observe a distribution of the 
activations a similar to the histogram shown in Figure 9.2. 

In stark contrast to Figure 9.1, the a activations obtained from our layer of sigmoid 
neurons is now normally distributed with a mean of ~0.5 and few (if any) values at the 
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Figure 9.2 Histogram of the a activations output by a layer of sigmoid neurons, with 
weights initialized using the Glorot normal distribution 



4. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. 
Proceedings of Machine Learning Research, 9, 249—56. 

5. Some folks also refer to them as Xavier distributions. 

6. T he Glorot no rmal distribution is a truncated normal distribution. It is centered at 0 with a Standard deviation 

of yj n ^ , where n is the number of neurons in the preceding layer and n 0 ut is the number of neurons 

in the subsequent layer. 

7. Select Kernel from the Jupyter notebook menu bar and choose Restart & Run All. This ensures you start com- 
pletely fresh and don’t reuse old parameters from the previous run. 
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extremes of the sigmoid range (i.e., less than 0.1 or greater than 0.9). This is a good 
starting point for a neural network because now: 

1. Few, if any, of the neurons are saturated. 

2. It implies that the neurons generally have weak opinions about how x would influ- 
ence y, which—prior to any training on data—is sensible. 


As demonstrated in this section, one of the potentially confusing aspects of weight 
initialization is that, if we would like the a values returned by a layer of artificial 
neurons to be normally distributed (and we do!), we should not sample our initial 
weights from a Standard normal distribution. 


In addition to the Glorot normal distribution, there is also the Glcrot uniform distri¬ 
butioni The impact of selecting one of these Glorot distributions over the other when 
initializing your model weights is generally unnoticeable. Youre welcome to rerun the 
notebook while sampling values from the Glorot uniform distribution by setting w_i ni t 
equal to gl orot_uni form (). Your histogram of activations should come out more or less 
indistinguishable from Figure 9.2. 

By swapping out the sigmoid activation function in Example 9.2 with tanh 
(Acti vati on (' tanh ')) or ReLU (Acti vati on (' rei u ')) in the Weight Initialization 
notebook, you can observe the consequences of initializing weights with values sampled 
from a Standard normal distribution relative to a Glorot distribution across a range of 
activations. As shown in Figure 9.3, regardless of the chosen activation function, weight 
initialization with the Standard normal leads to a activation outputs that are extreme 
relative to those obtained when initializing with Glorot. 

To be sure youre aware of the parameter initialization approach used by Keras, you 
can delve into the librarys documentation on a layer-by-layer basis, but, as we’ve sug- 
gested here, its default configuration is typically to initialize biases with 0 and to initialize 
weights with a Glorot distribution. 


Glorot initialization is probably the most popular technique for initializing weights, but 
there are other sensible options such as He initialization 8 9 and LeCun initialization. 10 
In our experience, the difference in outcome when selecting between these weight- 
initialization techniques is minima! to imperceptible. 


8. The Glorot uniform distribution is on the range [—/, 1] where l 



Jj_ 

~\~ n out 


9. He, Y., et al. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classifi- 
cation. arXiv: 1502.01852. 

10. LeCun, Y., et al. (1998). Efficient backprop. In G. Montavon et al. (Eds.) Neural Networks: Tricks of the Trade. 
Lecture Notes in Computer Science, 7700 (pp. 355—65). Berlin: Springer. 
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(a) tanh with Standard normal init. (b) ReLU with Standard normal init. 
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(c) tanh with Giorot init. (d) ReLU with Glorot init. 



Figure 9.3 The activations output by a dense layer of 256 neurons, while varying 
activation function (tanh or ReLU) and weight initialization (Standard normal or Glorot 
uniform). Note that while the distributions in (b) and (d) appear comparable at first 
glance, the Standard normal initialization produced large activation values (reaching 
toward 40) while all the activations resulting from Glorot initialization are below 2. 


Unstable Gradients 


Another issue associated with artificial neural networks, and one that becomes especially 
perturbing as we add more hidden layers, is unstable gradients. Unstable gradients can either 
be vanishing or explosive in nature. We cover both varieties in turn here, and then discuss 
a solution to these issues called batch normalization. 


Vanishing Gradients 

Recall that using the cost C between the networks predicted y and the true y, as 
depicted in Figure 8.6, backpropagation works its way from the output layer toward 
the input layer, adjusting network parameters with the aim of minimizing cost. As exem- 
plifted by the mountaineering trilobite in Figure 8.2, each of the parameters is adjusted in 
proportion to its gradient with respect to cost: If, for example, the gradient of a parameter 
(with respect to the cost) is large and positive, this implies that the parameter contributes a 
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large amount to the cost, and so decreasing it proportionally would correspond to a decrease 
in cost. 11 

In the hidden layer that is closest to the output layer, the relationship between its 
parameters and cost is the most direct. The farther away a hidden layer is frorn the output 
layer, the more muddled the relationship between its parameters and cost becomes. The 
impact of this is that, as we move from the final hidden layer toward the first hidden layer, 
the gradient of a given parameter relative to cost tends to flatten; it gradually vanishes. As 
a resuit of this, and as captured by Figure 8.8, the farther a layer is from the output layer, 
the more slowly it tends to learn. Because ofthis vanishing gradient problem, ifwe were to 
naively add more and more hidden layers to our neural network, eventually the hidden 
layers farthest from the output would not be able to learn to any extent, crippling the 
capacity for the network as a whole to learn to approximate y given x. 

Exploding Gradients 

Although they occur much less frequently than vanishing gradients, certain network 
architectures (e.g., the recurrent nets introduced in Chapter 11) can induce exploding 
gradients. In this case, the gradient between a given parameter relative to cost becomes 
increasingly steep as we move from the final hidden layer toward the first hidden layer. 

As with vanishing gradients, exploding gradients can inhibit an entire neural networks 
capacity to learn by saturating the neurons with extreme values. 

Batch Normalization 

During neural network training, the distribution of hidden parameters in a layer may 
gradually move around; this is known as internal comriate shijt. In fact, this is sort of the 
point of training: We want the parameters to change in order to learn things about the 
underlying data. But as the distribution of the weights in a layer changes, the inputs to 
the next layer might be shifted away from an ideal (i.e., normal, as in Figure 9.2) dis¬ 
tribution. Enter batch normalization (or batch norm for short). 12 Batch norm takes the a 
activations output from the preceding layer, subtracts the batch mean, and divides by the 
batch Standard deviation. This acts to recenter the distribution of the a values with a 
mean of 0 and a Standard deviation of 1 (see Figure 9.4). Thus, if there are any extreme 
values in the preceding layer, they won’t cause exploding or vanishing gradients in the 
next layer. In addition, batch norm has the following positive effects: 

■ It allows layers to learn more independently from each other, because large values 
in one layer won’t excessively influence the calculations in the next layer. 

■ It allows for selection of a higher learning rate—because there are no extreme 
values in the normahzed activations—thus enabling faster learning. 

■ The layer outputs are normalized to the batch mean and Standard deviation, and 
that adds a noise element (especially with smaller batch sizes), which, in turn, 


11. The change is directly proportional to the negative magnitude of the gradient, scaled by the learning rate 77 . 

12. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal 
covariate shift. arXiv: 1502.03167. 
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batch norm 



Figure 9.4 Batch normalization transforms the distributiori of the activations output by 
a given layer of neurons toward a Standard normal distribution. 


contributes to regularization. (Regularization is covered in the next section, but 
suffice it to say here that regularization helps a network generalize to data it hasn’t 
encountered previously, which is a good thing.) 

Batch normalization adds two extra learnable parameters to any given layer it is applied 
to: 7 (gamma) and /3 (beta). In the final step of batch norm, the outputs are linearly 
transformed by multiplying by 7 and adding /3, where 7 is analogous to the Standard 
deviation, and /3 to the mean. (You may notice this is the exact inverse of the operation 
that normalized the output values in the first place!) However, the output values were 
originally normalized by the batch mean and batch Standard deviation, whereas 7 and /3 
are learned by SGD. We initialize the batch norm layer with 7 = 1 and /3 = 0, and thus 
at the start of training this linear transformation makes no changes; batch norm is allowed 
to normalize the outputs as intended. As the network learns, though, it may determine 
that denormalizing any given layers activations is optimal for reducing cost. In this way, if 
batch norm is not helpful the network will leam to stop using it 011 a layer-by-layer basis. 
Indeed, because 7 and /3 are continuous variables, the network can decide to what degree 
it would like to denormalize the outputs, depending on what works best to minimize the 
cost. Pretty neat! 
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Model Generalization (Avoiding Overfitting) 

In Chapter 8, we mention that after training a model for a certain number of epochs the 
cost calculated on the validation dataset—which may have been decreasing nicely over 
earlier epochs—could begin to increase despite the fact that the cost calculated on the 
training dataset is stili decreasing. This situation—when training cost continues to go down 
while validation cost goes up—is formally known as overfitting. 

We illustrate the concept of overfitting in Figure 9.5. Notice we have the same data 
points scattered along the x and y axes in each panel. We can imagine that there is some 
underlying distribution that describes these points, and we show a samphng from that dis— 
tribution. Our goal is to generate a model that explains the relationship between x and y 
but, perhaps most importantly, one that also approximates the original distribution; in this 
way, the model will be able to generalize to new data points drawn ffom the distribution 
and not just model the samphng of points we already have. 

In the first panel (top left) of Figure 9.5, we use a single-parameter model, which is 
limited to fitting a straight line to the data. 13 This straight line underfts the data: The cost 
(represented by the vertical gaps between the line and the data points) is high, and the 
model would not generalize well to new data points. In other words, the line misses most 
of the points because this kind of model is not complex enough. In the next panel (top 
right), we use a model with two parameters, which fits a parabola-shaped curve to the 
data. 14 With this parabolic model, the cost is much lower relative to the linear model, and 
it appears the model would also generalize well to new data—great! 

In the third panel (bottom left) of Figure 9.5, we use a model with too many 
parameters—more parameters than we have data points. With this approach we reduce 
the cost associated with our training data points to nil: There is no perceptible gap 
between the curve and the data. In the last panel (bottom right), however, we show 
new data points from the original distribution in green, which were unseen by the model 
during training and so can be used to validate the model. Despite eliminating training 
cost entirely, the model fits these validation data poorly and so it results in a significant 
validation cost. The many-parameter model, dear friends, is overfit: It is a perfect model 
for the training data, but it doesn’t actually capture the relationship between x and y well; 
rather, it has learned the exact features of the training data too closely, and it subsequently 
performs poorly on unseen data. 

Consider how in three lines of code in Example 5.6, we created a shallow neural net- 
work architecture with more than 50,000 parameters (Figure 7.5). Given this, it should 
not be surprising that deep learning architectures regularly have milhons of parameters. 15 
Working with datasets that have such a large number of parameters but with perhaps only 
thousands of training samples could be a recipe for severe overfitting. 16 Because we yearn 
to capitalize on deep, sophisticated network architectures even if we doni have oodles 


13. This models a linear relationship, the simplest forni of regression between two variables. 

14. Recall the quadratic function from high school algebra. 

15. Indeed, as early as Chapter 10, you’11 encounter models with tens ofmillions of parameters. 

16. This circumstance can be annotated as n p, indicating the number of samples is much greater than the 
parameter count. 
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Figure 9.5 Fittingy given x using models with varying numbers of parameters. Top left: 

A single-parameter model underfits the data. Top right: A two-parameter model fits a 
parabola that suits the relationship between x and y well. Bottom left: A many-parameter 
model overfits the data, generalizing poorly to new data points (shown in green in the 

bottom-right panel). 



of data at hand, thankfully we can rely on techniques specifically designed to reduce 
overfitting. In this section, we cover three of the best-known such techniques: L1/L2 
regularization, dropout, and data augmentation. 

L1 and L2 Regularization 

In branches ofmachine learning other than deep learning, the use of L1 or L2 yeguiariza- 
tion to reduce overfitting is prevalent. These techniques—which are alternately known 
as L/ISSO 17 regression and ridge regression, respectively—both penalize models for in- 
cluding parameters by adding the parameters to the models cost function. The larger a 


17. Least absolutely shrinkage and selection operator 
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given parameter’s size, the more that parameter adds to the cost function. Because of this, 
parameters are not retained by the model unless they appreciably contribute to the reduc - 
tion of the difference between the models estimated y and the true y. In other words, 
extraneous parameters are pared away. 


The distinction between L1 and L2 regularization is that Ll’s additions to cost corre- 
spond to the absolute value of parameter sizes, whereas L2’s additions correspond to 
the square of these. The net effect of this is that L1 regularization tends to lead to the 
inclusion of a smaller number of larger-sized parameters in the model, while L2 regu¬ 
larization tends to lead to the inclusion of a larger number of smaller-sized parameters. 


Dropout 

L1 and L2 regularization work fine to reduce overfitting in deep learning models, but 
deep learning practitioners tend to favor the use of a neural-network-specific regulariza¬ 
tion technique instead. This technique, called dropout, was developed by Geoff Hinton 
(Figure 1.16) and his colleagues at the University of Toronto 18 and was made famous by 
its incorporation in their benchmark-smashing AlexNet architecture (Figure 1.17). 

Hinton and his coworkers’ intuitive yet powerful concept for preventing overfitting 
is captured by Figure 9.6. In a nutshell, dropout simply pretends that a randomly selected 
proportion of the neurons in each layer don’t exist during each round of training. To illus¬ 
trate this, three rounds of training 19 are shown in the figure. For each round, we remove 
a specified proportion of hidden layers by random selection. For the first hidden layer of 
the network, we’ve configured it to drop out one-third (33.3 percent) of the neurons. For 
the second hidden layer, we’ve configured 50 percent of the neurons to be dropped out. 
Let’s cover the three training rounds shown in Figure 9.6: 

1. In the top panel, the second neuron of the first hidden layer and the first neuron of 
the second hidden layer are randomly dropped out. 

2. In the middle panel, it is the first neuron of the first hidden layer and the second 
one of the second hidden layer that are selected for dropout. There is no “memory” 
of which neurons have been dropped out on previous training rounds, and so it is 
by chance alone that the neurons dropped out in the second round are distinet from 
those dropped out in the first. 

3. In the bottom panel, the third neuron of the first hidden layer is dropped out for 
the first time. For the second consecutive round of training, the second neuron of 
the second hidden layer is also randomly selected. 

Instead ofreining in parameter sizes toward zero (as with bateh normalization), 
dropout doesiTt (directly) constrain how large a given parameter value can become. 
Dropout is nevertheless an effective regularization technique, because it prevents any 


18. Hinton, G., et al. (2012). Improving neural networks by preventing co-adaptation of feature detectors. 
arXiv:1207.0580. 

19. If the phrase round of training is not immediately familiar, refer to Figure 8.5 for a refresher. 
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Figure 9.6 Dropout, a technique for reducing model overfitting, involves the removal of 
randomly selected neurons from a network’s hidden layers in each round of training. 
Three rounds of training with dropout are shown here. 


single neuron from becoming excessively influential within the network: Dropout makes 
it challenging for sonae very specific aspect of the training dataset to create an overly spe- 
cific forward-propagation pathway through the network because, on any given round of 
training, neurons along that pathway could be removed. In this way, the model doesn’t 
become overreliant on certain features of the data to generate a good prediction. 

When validating a neural network model that was trained using dropout, or indeed 
when making real-world inferences with such a network, we must take an extra step first. 
During validation or inference, we would like to leverage the power of the full network, 
that is, its total complement of neurons. The snag is that, during training, we only ever 
used a subset of the neurons to forward propagate x through the network and estimate y. 
Ifwe were to naively carry out this forward propagation with suddenly ali ofthe neurons, 
our y would emerge befuddled: There are now too many parameters, and the totals after 
ali the mathematical operations would be larger than expected. To compensate for the 
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additional neurons, we must correspondingly adjust our neuron parameters downward. If 
we had, say, dropped out half of the neurons in a hidden layer during training, then we 
would need to multiply the layers parameters by 0.5 prior to validation or inference. As a 
second example, for a hidden layer in which we dropped out 33.3 percent of the neurons 
during training, we then must multiply the layers parameters by 0.667 prior to valida¬ 
tion. 211 Thankfully, Keras handles this parameter-adjustment process for us automatically. 
When working in other deep learning libraries (e.g., low-level TensorFlow), however, 
you may need to be mindful and remember to carry out these adjustments yourself. 


If youre famihar with creating ensembles of statistical models (e.g., a single random 
forest out of multiple random decision trees), then it may already be evident to you 
that dropout produces such an ensemble. During each round of training, a random 
subnetwork is created, and its parameter values are tuned. Later, at the conclusion of 
training, all of these subnetworks are reflected in the parameter values throughout the 
final network. In this way, the final network is an aggregated ensemble of its constituent 
subnetworks. 

As with learning rate and mini-batch size (discussed in Chapter 8), network architec- 
ture options pertaining to dropout are hyperparameters. Here are our rules of thumb for 
choosing which layers to apply dropout to and how much of it to apply: 

■ Ifyour network is overfitting to your training data (i.e., your validation cost in- 
creases while your training cost goes down), then dropout is warranted somewhere 
in the network. 

■ Even if your network isn’t obviously overfitting to your training data, adding some 
dropout to the network may improve validation accuracy—especially in later epochs 
of training. 

■ Applying dropout to all of the hidden layers in your network may be overkill. If 
your network has a fair bit of depth, it may be sufficient to apply dropout solely 
to later layers in the network (the earliest layers may be harmlessly identifying 
features). To test this out, you could begin by applying dropout only to the final 
hidden layer and observing whether this is sufficient for curtailing overfitting; if not, 
add dropout to the next deepest layer, test it, and so on. 

■ If your network is struggling to reduce validation cost or to recapitulate low vali¬ 
dation costs attained when less dropout was applied, then you’ve added too much 
dropout—pare it back! As with other hyperparameters, there is a Goldilocks zone 
for dropout, too. 

■ With respect to how much dropout to apply to a given layer, each network behaves 
uniquely and so some experimentation is required. In our experience, dropping 
out 20 percent up to 50 percent of the hidden-layer neurons in machine vision 
applications tends to provide the highest validation accuracies. In natural language 


20. Put another way, if the probability of a given neuron being retained during training is p , then we multiply that 
neuron’s parameters by p prior to carrying out model validation or inference. 
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applicatioris, where individual words and phrases can convey particular significance, 
we have found that dropping out a smaller proportion—between 20 percent and 30 
percent of the neurons in a given hidden layer—tends to be optimal. 

Data Augmentation 

In addition to regularizing your models parameters to reduce overfitting, another 
approach is to increase the size of your training dataset. If it is possible to inexpensively 
collect additional high-quality training data for the particular modeling problem youre 
working on, then you should do so! The more data provided to a model during training, 
the better the model will be able to generalize to unseen validation data. 

In many cases, collecting fresh data is a ptpe dream. It may nevertheless be possible to 
generate new training data from existing data by augmenting it, thereby artificially 
expanding your training dataset. With the MNIST digits, for example, many different 
types of transforms would yield training samples that constitute suitable handwritten 
digits, such as: 

■ Skewing the image 

■ Blurring the image 

■ Shifting the image a few pixels 

■ Applying random noise to the image 

■ Rotating the image slightly 

Indeed, as shown on the personal website ofYann LeCun (see Figure 1.9 for a portrait), 
many of the record-setting MNIST validation dataset classifiers took advantage of such 
artificial training dataset expansion . 21,22 

Fancy Optimizers 

So far in this book we’ve used only one optimization algorithm: stochastic gradient 
descent. Although SGD performs well, researchers have devised shrewd ways to im- 
prove it. 

Momentum 

The first SGD improvement is to consider momentum. Heres an analogy of the principle: 
Lefs imagine ifs winter and our intrepid trilobite is skiing down a snowy gradient- 
mountain. If a local minimum is encountered (as in the middle panel ol Figure 8.7), the 
momentum of the trilobite’s movement down the slippery hili will keep it moving, and 
the minimum will be easily bypassed. In this way, the gradients on previous steps have 
influenced the current step. 

We calculate momentum in SGD by taking a moving average of the gradients for each 
parameter and using that to update the weights in each step. When using momentum, 
we have the additional hyperparameter /3 (beta), which ranges from 0 to 1, and which 
Controls how many previous gradients are incorporated in the moving average. Small /3 


21. yann.lecun.com/exdb/mnist 

22. We will use Keras data-augmentation tools on actual images of hot dogs in Chapter 10. 
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values permit older gradients to contribute to the moving average, something that can 
be unhelpful; the trilobite wouldn’t want the steepest part of the hili to guide its speed 
as it approaches the lodge for its apres-ski drinks. Typically we’d use larger /3 values, with 
/3 = 0.9 serving as a reasonable default. 

Nesterov Momentum 

Another version of momentum is called Nesterov momentum. In this approach, the mov¬ 
ing average of the gradients is first used to update the weights and fmd the gradients at 
whatever that position may be; this is equivalent to a quick peek at the position where 
momentum might take us. We then use the gradients from this sneak-peek position to 
execute a gradient stepfrom our originalposition. In other words, our trilobite is suddenly 
aware of its speed down the hili, so its taking that into account, guessing where its own 
momentum might be taking it, and then adjusting its course before it even gets there. 

AdaGrad 

Although both momentum approaches improve SGD, a shortcoming is that they both 
use a single learning rate r] for all parameters. Imagine, if you will, that we could have 
an individual learning rate for each parameter, thus enabling those parameters that have 
already reached their optimum to slow or halt learning, while those that are far from their 
optima can keep going. Well, youre in luck! Thats exactly what can be achieved with the 
other optimizers we’ll discuss in this section: AdaGrad, AdaDelta, RMSProp, and Adam. 

The name AdaGrad comes from “adaptive gradient." 23 In this variation, every param¬ 
eter has a unique learning rate that scales depending on the importance of that feature. 
This is especially useful for sparse data where some features occur only rarely: When those 
features do occur, we’d like to make larger updates of their parameters. We achieve this 
individualization by maintaining a matrix of the sum of squares of the past gradients for 
each parameter, and dividing the learning rate by its square root. AdaGrad is the first 
introduction to the parameter e (epsilon), which is a doozy: Epsilon is a smoothing factor 
to avoid divide-by-zero errors and can safely be left at its default value of e = 1 x 1CR 8 . 24 

A significant benefit of AdaGrad is that it minimizes the need to tinker with the learn¬ 
ing rate hyperparameter ?y. You can generally just set-it-and-forget-it at its default of 
r] = 0.01. A considerable downside of AdaGrad is that, as the matrix ofpast gradients 
increases in size, the learning rate is increasingly divided by a larger and larger value, 
which eventually renders the learning rate impractically small and so learning essentially 
stops. 

AdaDelta and RMSProp 

AdaDelta resolves the gradient-matrix-size shortcoming of AdaGrad by maintaining 
a moving average of previous gradients in the same manner that momentum does. 25 


23. Duchi, J., et al. (2011). Adaptive subgradient methods for Online learning and stochastic optimization. Journal 
of Machine Learning Research, 12, 2121—59. 

24. AdaGrad, AdaDelta, RMSProp, and Adam all use e for the same purpose, and it can be left at its default across 
all of these methods. 

25. Zeiler, M.D. (2012). ADADELTA: An adaptive learning rate method. arXiv: 1212.5701. 
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AdaDelta also eliminates the 77 term, so a learning rate doesn’t need to be configured 
at ali . 26 

RMSProp (root mean square propagation) was developed by Geoff Hinton (see 
Figure 1.16 for a portrait) at about the same time as AdaDelta . 27 It works similarly except 
it retains the learning rate 77 parameter. Both RMSProp and AdaDelta involve an extra 
hyperparameter p (rho), or decay rate, which is analogous to the /3 value from momen¬ 
tum and which guides the size of the window for the moving average. Reconunended 
values for the hyperparameters are p = 0.95 for both optimizers, and setting 77 = 0.001 
for RMSProp. 

Adam 

The final optimizer we discuss in this section is also the one we employ most often in the 
book. Adam —short for adaptive moment estimation—builds on the optimizers that came 
before it. 2x Ifs essentially the RMSProp algorithm with two exceptions: 

1. An extra moving average is calculated, this time of past gradients for each parameter 
(called the average first moment of the gradient , 29 or simply the mean) and this is 
used to inform the update instead of the actual gradients at that point. 

2. A elever bias trick is used to help prevent these moving averages from skewing 
toward zero at the start of training. 

Adam has two /3 hyperparameters, one for each of the moving averages that are calcu¬ 
lated. Recommended defaults are /3i = 0.9 and P 2 = 0.999. The learning rate default 
with Adam is 77 = 0.001, and you can generally leave it there. 

Because RMSProp, AdaDelta, and Adam are so similar, they can be used interchange- 
ably in similar applications, although the bias correction may help Adam later in training. 
Even though these newfangled optimizers are in vogue, there is stili a strong case for 
simple SGD with momentum (or Nesterov momentum), which in some cases performs 
better. As with other aspects of deep learning models, you can experiment with optimiz¬ 
ers and observe what works best for your particular model architecture and problem. 


A Deep Neural Network in Keras 

We can now sound the trumpet, because we’re reached a momentous milestone! With the 
additional theory we’ve covered in this chapter, you have enough knowledge under your 


26. This is achieved through a crafty mathematical trick that we don’t think is worth expounding on here. You 
may notice, however, that Keras and TensorFlow stili have a learning rate parameter in their implementations of 
AdaDelta. In those cases, it is recommended to leave 77 at 1, that is, no scaling and therefore no functional learning 
rate as you have come to know it in this book. 

27. This optimizer remains unpublished. It was first proposed in Lecture 6 e of Hinton s Coursera Course “Neural 
Networks for Machine Learning” (www. cs . toronto. edu/~hi nton/coursera/1 ecture6/l ec6. pdf). 

28. Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980. 

29. The other moving average is of the squares of the gradient, which is the second moment of the gradient, or 
the variance. 
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belt to competently design and train a deep learning model. If you’d like to follow along 
interactively as we do so, pop into die accompanying Deep Net in Keras Jupyter notebook. 
Relative to our shallow and intermediate-depth model notebooks (refer to Example 5.1), 
we have a pair of additional dependencies—namely, dropout and batch normalization—as 
provided in Example 9.4. 

Example 9.4 Additional dependencies for deep net in Keras 

from keras.layers import Dropout 

from keras.1ayers.normalization import BatchNormalization 

We load and preprocess the MNIST data in the same way as previously. As shown in 
Example 9.5, itis the neural network architecture cell where we begin to diverge. 

Example 9.5 Deep net in Keras model architecture 

model = Sequential() 

model,add(Dense(64, acti vation= 'reiu', input_shape= (784 ,))) 
model.add(BatchNormalizati on()) 

model.add(Dense (64 , acti vation= 'reiu' )) 
model.add(BatchNormalization()) 

model.add(Dense (64 , acti vation= 'reiu' )) 
model.add(BatchNormalizati on()) 
model.add(Dropout (0.2) ) 

model.add(Dense (10, acti vation= 'softmax' )) 

As before, we instantiate a Sequenti al model object. After we add our first hidden 
layer to it, however, we also add a BatchNormal i zati on () layer. In doing this we are 
not adding an actual layer replete with neurons, but rather we’re adding the batch-norm 
transformation for the activations a from the layer before (the first hidden layer). As with 
the first hidden layer, we also add a BatchNormal i zati on () layer atop the second hid¬ 
den layer of neurons. Our output layer is identical to the one used in the shallow and 
intermediate-depth nets, but to create an honest-to-goodness deep neural network, we 
are further adding a third hidden layer of neurons. As with the preceding hidden layers, 
the third hidden layer consists of 64 batch-normalized rei u neurons. We are, however, 
supplementing this fmal hidden layer with Dropout, set to remove one-fifth (0.2) of the 
layer s neurons during each round of training. 

As captured in Example 9.6, the only other change relative to our intermediate-depth 
network is that we use the Adam optimizer (optimizer= ' adam 1 ) in place of ordinary 
SGD optimization. 
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Example 9.6 Deep net in Keras model compilation 

model .compile(loss='categorical_crossentropy ', 
optimizer=' adam' , 
metrics=[ 'accuracy' ]) 

Note that we need not supply any hyperparameters to the Adam optimizer, because 
Keras automatically includes ali the sensible defaults we detailed in the preceding section. 
For all of the other optimizers we covered, Keras (and TensorFlow, for that matter) has 
implementations that can easily be dropped in in place of ordinary SGD or Adam. You 
can refer to the documentation for those libraries ordine to see exactly how it’s done. 

When we call the f i t () method on our model, 30 we discover that our digestion of all 
the additional theory in this chapter paid off: With our intermediate-depth network, our 
validation accuracy plateaued around 97.6 percent, but our deep net attained 97.87 per- 
cent validation accuracy following 15 epochs oftraining (see Figure 9.7), shaving away 11 
percent of our already-small error rate. To squeeze even morejuice out ofthe error-rate 
lemon than that, we’re going to need machine-vision-specific neuron layers such as those 
introduced in the upcoming Chapter 10. 

Regression 

In Chapter 4, when discussing supervised learning problems, we mentioned that these can 
involve either classification or regression. In this book, nearly all our models are used for 
classifying inputs into one category or another. In this section, however, we depart from 
that tendency and highlight how to adapt neural network models to regression tasks— 
that is, any problem where you cl like to predict some continuous variable. Examples of 
regression problems include predicting the future price of a stock, forecasting how many 
centimeters of rain may fall tomorrow, and modeling how many sales to expect of a par- 
ticular product. In this section, we use a neural network and a classic dataset to estimate 
the price of housing near Boston, Massachusetts, in the 1970s. 

Our dependencies, as shown in our Regression in Keras notebook, are provided in 
Example 9.7. The only unfamiliar dependency is the boston_housi ng dataset, which is 
conveniently bundled into the Keras library. 

Epoch 15/26 

66660/6000« ] - ls 23us/step - loss: 6.6288 - acc: 0.9966 - val_loss: 6.0865 - val.acc: 0.9787 

Epoch 16/20 

60666/60006 - ls 22us/step - loss: 0.0246 - acc: 0.9919 - val_loss: 0.0880 - val_acc: 0.9767 

Figure 9.7 Our deep neural network architecture peaked at a 97.87 percent validation 
following 15 epochs of training, besting the accuracy of our shallow and 
intermediate-depth architectures. Because ofthe randomness of network initialization 
and training, you may obtain a slightly lower or a slightly higher accuracy with the 

identical architecture. 


30. This model . f i t () step is exactly the sanie as for our Intermediate Net in Keras notebook, that is, Example 8.3. 
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Example 9.7 Regression model dependencies 

from keras.datasets import boston„housing 
from keras.models import Sequential 
from keras.layers import Dense, Dropout 

from keras.1ayers.normalization import BatchNormalization 

Loading the data is as simple as with the MNIST digits: 

(X_train, y_train), (X_valid, y_valid) = boston_housing.1oad_data() 

Calling the shape parameter of X_trai n and X_val i d, we find that there are 404 training 
cases and 102 validation cases. For each case—a distinet area ofthe Boston suburbs—we 
have 13 predictor variables related to building age, mean number ofrooms, crime rate, 
the local student-to-teacher ratio, and so on. 31 The median house price (in thousands of 
dollars) for each area is provided in the y variables. As an example, the first case in the 
training set has a median house price of $15,200. 32 

The network architecture we built for house-price prediction is provided in 
Example 9.8. 

Example 9.8 Regression model network architecture 

model = Sequential() 

model,add(Dense(32, input_dim=1 3 , acti vation=' reiu ')) 
model.add(BatchNormalizati on()) 

model.add(Dense(1 6 , acti vation= 'reiu' )) 
model.add(BatchNormalization()) 
model.add(Dropout(0 . 2)) 

model.add(Dense(1, acti vation= '1 i near' )) 

Reasoning that with only 13 input values and a few hundred training cases we would 
gain little from a deep neural network with oodles of neurons in each layer, we opted 
for a two-hidden-layer architecture consisting of merely 32 and 16 neurons per layer. We 
applied bateh normalization and a touch of dropout to avoid overfitting to the particular 
cases of the training dataset. Most critically, in the output layer we set the acti vati on 
argument to 1 i near—the option to go with when you’d like to predict a continuous 
variable, as we do when performing regression. The linear activation function outputs 
z directly so that the network’s y can be any numeric value (representing, e.g., dollars, 


31. You can read more about the data by referring to the article they were originally published in: Harrison, D., 
& Rubinfeld, D. L. (1978). Hedonic prices and the demand for clean air .Journal of Environmental Economics and 
Management, 5, 81—102. 

32. Running y_train[0] returns 15.2. 
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centimeters) instead of being squashed into a probability between 0 and 1 (as happens 
when you use the sigmoid or softmax activation functions). 

When compiling the model (see Example 9.9), another regression-specific 
adjustment we made is using mean squared error (MSE) in place of cross-entropy 
(1 oss= ' mean_squared_error ' ). While we’ve used cross-entropy cost exclusively so 
far in this book, that cost function is specifically designed for classification problems, in 
which y is a probability. For regression problems, where the output is inherently not a 
probabilty, we use MSE instead. 33 

Example 9.9 Compiling a regression model 

model .compile(loss='mean_squared_error' , optimizer='adam' ) 

You may have noticed that we left out the accuracy metric when compiling this time 
around. This is deliberate: There’s no point in calculating accuracy, because this metric 
(the percentage of cases classified correcdy) isn’t relevant to continuous variables as it is 
with categorical ones. 34 

Fitting our model (as in Example 9.10) is one step that is no different from classifica¬ 
tion. 

Example 9.10 Fitting a regression model 

model.fit(X_train, y_train, 

batch_size=8, epochs=32, verbose=1, 
vali dation„data=(X_valid, y_valid)) 

We trained for 32 epochs because, in our experience with this particular model, training 
for longer produced no lower validation losses. We didn’t spend any time optimizing 
the batch-size hyperparameter, so there could be small accuracy gains to be made by 
varying it. 

During our particular run of the regression model, our lowest validation loss (25.7) 
was attained in the 22nd epoch. By our final (32nd) epoch, this loss had risen consid- 
erably to 56.5 (for comparison, we had a validation loss of 56.6 after just one epoch). 

In Chapter 11, we demonstrate how to save your model parameters after each epoch of 
training so that the best-performing epoch can be reloaded later, but for the time being 
we’re stuck with the relatively crummy parameters from the final epoch. In any event, if 
you’d like to see specific examples of model house-price inferences given some particular 
input data, you can do this by running the code provided in Example 9.11. 35 


33. There are other cost functions applicable to regression problems, such as mean absolute error (MAE) and 
Huber loss, although these aren’t covered in this book. MSE should serve you well enough. 

34. It s also helpful to remember that, generally, accuracy is used only to set our minds at ease about how well our 
models are performing. The model itselflearns from the cost, not the accuracy. 

35. Note that we had to use the NumPy reshape method to pass in the 13 predictor variables of the 43rd case as 
a row-oriented array ofvalues ([1 , 13]) as opposed to as a column. 
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Example 9.11 Predicting the median house price in a particular suburb of Boston 

model.predict(np.reshape(X_valid[42], [1, 13])) 

This returned for us a predicted median house price ( y ) of $20,880 for the 43rd Boston 
suburb in the validation dataset. The actual median price (y; which can be output by 
calling y_val i d [42]) is $14,100. 


TensorBoard 


When evaluating the performance of your model epoch over epoch, it can be tedious and 
time-consuming to read individual results numerically, as we did after running the code 
in Example 9.10, particularly if the model has been training for many epochs. Instead, 
TensorBoard (Figure 9.8) is a convenient, graphical tool for: 

■ Visually tracking model performance in real time 

■ Reviewing historical model performances 

■ Comparing the performance of various model architectures and hyperparameter 
settings apphed to fittmg the same data 

TensorBoard comes automatically with the TensorFlow library, and instructions for get- 
ting it up and running are available via the TensorFlow site. 36 It’s generally straightforward 
to set up. Provided here, for example, is a procedure that adapts our Deep Net in Keras 
notebook for TensorBoard use on a Unix-based operating system, including macOS: 

1. As shown in Example 9.12, change your Python code as follows: 37 

a. Import the TensorBoard dependency from keras . cal 1 backs. 

b. Instantiate a TensorBoard object (we’ll call it tensorboard), and specify a 
new, unique directory name (e.g., deep-net) that you’d like to create and 
have TensorBoard log data written into for this particular run of model- 
fitting: 

tensorboard = TensorBoard(1og_dir='1ogs/deep-net') 

c. Pass the TensorBoard object as a cal 1 back parameter to the f i t ( ) method: 
callbacks = [tensorboard] 

2. In your terminal, run the following: 38 

tensorboard --1ogdir='1ogs/deep-net' --port 6006 

3. Navigate to 1 ocal host : 6006 in your favorite web browser. 


36. tensorf1 ow.org/gui de/summaries_and_tensorboard 

37. This is also laid out in our Deep Net in Keras with TensorBoard notebook. 

38. Note: We specified the same logging directory location that the TensorBoard object was set to use in step lb. 
Since we specified a relative path and not an absolute path for our logging directory, we need to be mindful to run 
the tensorboard command from the same directory as our Deep Net in Keras with TensorBoard notebook. 
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Figure 9.8 The TensorBoard dashboard enables you to, epoch over epoch, visualiy 
track your modeTs cost (1 oss) and accuracy (acc) across both your training data and 

your validation (val) data. 
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Example 9.12 Using TensorBoard while fitting a model in Keras 

from keras.callbacks import TensorBoard 
tensorboard = TensorBoard(' logs/deep-net' ) 
model.fit(X_train, y_train, 

batch_size=128, epochs=20, 
verbose=1, 

validation_data=(X_valid, y_valid), 
cal1backs=[tensorboard]) 

By following these steps or an analogous procedure for die circumstances of your 
particular operating system, you should see something like Figure 9.8 in your browser 
window. From there, you can visually track any given models cost and accuracy across 
both your training and validation datasets in real time as these metrics change epoch over 
epoch. This kind of performance tracking is one of the primary uses of TensorBoard, 
although the dashboard interface also provides heaps of other functionality, such as visual 
breakdowns ofyour neural network graph and the distribution ofyour model weights. 
You can learn about these additional features by reading the TensorBoard docs and 
exploring the interface on your own. 

Summary 

Over the course of the chapter, we discussed common pitfalls in modeling with neural 
networks and covered strategies for minimizing their impact on model performance. 

We wrapped up the chapter by applying all of the theory learned thus far in the book to 
construet our first bona fide deep learning network, which provided us with our best-yet 
accuracy on MNIST handwritten-digit classification. While such deep, dense neural nets 
are applicable to generally approximating any given output y when provided some input 
x, they may not be the most efficient option for specialized modeling. Corning up next 
in Part III, we introduce neural network layers and deep learning approaches that excel 
at particular, specialized tasks, including machine vision, natural language processing, the 
generation of art, and playing games. 
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Key Concepts 


Here are the essential foundational 
chapter are highlighted in purple. 

■ parameters: 

■ weight w 

■ bias b 

■ activation a 

■ artificial neurons: 

■ sigmoid 

■ tanh 

■ ReLU 

■ linear 

■ input layer 

■ hidden layer 

■ output layer 

■ layer types: 

■ dense (fully connected) 

■ softmax 

■ cost (loss) functions: 

■ quadratic (mean squared 
error) 

■ cross-entropy 


thus far. New ternis from the current 

■ forward propagation 

■ backpropagation 

■ unstable (especially vanishing) 
gradients 

■ Glorot weight initialization 

■ batch normahzation 

■ dropout 

■ optiniizers: 

■ stochastic gradient descent 

■ Adam 

■ optimizer hyperparameters: 

■ learning rate r/ 

■ batch size 


concepts 
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Machine Vision 


\m/ elcome to Part III, dear reader. Previously, we provided a high-level overview of 
particular applications of deep learning (Part I). With the foundational, low-level theory 
we’ve covered since (in Part II), youre now well positioned to work through specialized 
content across a range of application areas, primarily via hands-on example code. In this 
chapter, for example, you’ll discover convolutional neural networks and apply them to 
machine vision tasks. In the remainder of Part III, we cover practical examples of: 

■ Recurrent neural networks for natural language processing in Chapter 11 

■ Generative adversarial networks for visual creativity in Chapter 12 

■ Deep reinforcement learning for sequential decision niaking within complex, 
changing environments in Chapter 13 

Convolutional Neural Networks 


A convolutional neural network —also known as a ConvNet or a CNN—is an artificial neural 
network that features one or more convolutional layers (also called conv layers). This layer 
type enables a deep learning model to efficiently process spatial patterns. As you’ll see 
firsthand in this chapter, this property rnakes convolutional layers especially effective in 
computer vision applications. 

The Two-Dimensional Structure of Visual Imagery 

In our previous code examples involving handwritten MNIST digits, we converted the 
image data into one-dimensional arrays of numbers so that we could feed them into a 
dense hidden layer. More specifically, we began with 28x28-pixel grayscale images 
and converted them into 784-element one-dimensional arrays. 1 Although this step was 
necessary in the context of a dense, fully connected network—we needed to flatten the 
784 pixel values so that each one could be fed into a neuron of the first hidden layer—the 
collapse of a two-dimensional image into one dimension corresponds to a substantial loss 
of meaningful visual image structure. When you draw a digit with a pen on paper, you 


1. Recall that the pixel values were divided by 255 in order to scale everything to [0 : 1]. 
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don’t conceptualize it as a continuous linear sequence of pixels running from top-left to 
bottom-right. If, for example, we printed an MNIST digit for you here as a 784-pixel 
long stream in shades of gray, we’d be willing to wager that you couldn’t identify the 
digit. Instead, humans perceive visual information in a two-dimensional form, 2 and our 
ability to recognize what we’re looking at is inherently tied to the spatial relationships 
between the shapes and colors we perceive. 

Computational Complexity 

In addition to the loss of two-dimensional structure when we collapse an image, a second 
consideration when piping images into a dense network is computational complexity. 

The MNIST images are very small—28x28 pixels with only one channel (there is only 
one color “channel” because MNIST digits are monochromatic; to render images in full 
color, in contrast, at least three channels—usually red, green, and blue—are required). 
Passing MNIST image information into a dense layer, that corresponds to 785 parameters 
per neuron: 784 weights for each of the pixels, plus the neurons bias. If we were handling 
a moderately sized image, however—say, a 200x200-pixel, full-color RGB 3 image— 
then the number of parameters increases dramatically. In that case, we’d have three color 
channels, each with 40,000 pixels, corresponding to a total of 120,001 parameters per 
neuron in a dense layer. 4 With a modest number of neurons in the dense layer—lets say 
64—that corresponds to nearly 8 million parameters associated with the first hidden layer 
of our network alone. 5 Furthermore, the image is only 200x200 pixels—thafs barely 
0.4MP, 6 whereas most modern smartphones have 12MP or greater camera sensors. Gen- 
erally, machine vision tasks don’t need to run on high-resolution images in order to 
be successful, but the point should be ciear: Images can contain a very large number of 
data points, and using these in a naive, fully connected manner will explode the neural 
networks compute power requirements. 

Convolutional Layers 

Convolutional layers consist of sets of kernels, which are also known as filters. Each of these 
kernels is a small window (called a patch) that scans across the image (in more technical 
ternis, the filter convolves), from top left to bottom right (see Figure 10.1 for an illustration 
of this convolutional operation). 

Kernels are made up of weights, which—as in dense layers—are learned through back- 
propagation. Kernels can range in size, but a typical size is 3x3, and we use that in the 
examples in this chapter. 7 For the monochromatic MNIST digits, this 3 X 3-pixel window 
would consist of 3 X 3 X 1 weights—nine weights, for a total of 10 parameters (like an 
artificial neuron in a dense layer, every convolutional filter has a bias term b). For 


2. Well. . . three-dimensional, but let s ignore depth for the purposes of this discussion. 

3. The red, green, and blue channels required for a full-color image. 

4. 200 pixels X 200 pixels X 3 color channels + 1 bias = 120,001 parameters. 

5. 64 neurons X 120,001 parameters per neuron = 7,680,064 parameters. 

6. Megapixels. 

7. Another typical size is 5x5, with kernels larger than that used infrequently. 
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Figure 10.1 When reading a page of a book written in English, we begin in the top-left 
corner and read to the right. Every time we reach the end of a row of text, we progress to 
the next row. in this way, we eventually reach the bottom-right corner, thereby reading ali 
of the words on the page. Analogously, the kernel in a convolutional layer begins on a 
small window of pixels in the top-left corner of a given image. From the top row 
downward, the kernel scans from left to right, until it eventually reaches the bottom-right 
corner, thereby scanning ali of the pixels in the image. 


comparison, if we happened to be working with full-color RGB images, then a kernel 
covering the same number of pixels would have three times as many weights—3 X 3 X 3 
of them, for a total of27 weights and 28 parameters. 

As depicted in Figure 10.1, the kernel occupies discrete positions across an image as it 
convolves. Sticking with the 3x3 kernel size for this explanation, during forward prop- 
agation a multidimensional variation of the “most important equation in this book”— 
w ■ x + b (introduced in Figure 6.7)—is calculated at each position that the kernel 
occupies as it convolves over the image. Referring to the 3x3 window of pixels and the 
3x3 kernel in Figure 10.2 as inputs x and weights w, respectively, we can demonstrate 
the calculation of the weighted sum w ■ x in which products are calculated elementwise 
based on the alignment of vertical and horizontal locations. Ifs helpful to imagine the 
kernel superimposed over the pixel values. The math is presented here: 


w ■ x = .01 x .53 + .09 x .34 + .22 x .06 

+ -1.36 x .37 + .34 x .82 + -1.59 x .01 

( 10 . 1 ) 

+ .13 x .62 + -.69 x .91 + 1.02 x .34 
= -0.3917 
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Figure 10.2 A 3x3 kernel and a 3x3-pixel window 


Next, using Equation 7.1, we add some bias terni b (say, —0.19) to arrive at 2 : 


z = w ■ x + b 
= -0.39 + b 
= -0.39 + 0.20 
= -0.19 


( 10 . 2 ) 


With z, we can at last calculate an activation value a by passing z through the activation 
function of our choice, say the tanh function or the ReLU function. 

Note that the fundamental operation hasn’t changed relative to the artificial neuron 
mathematics of Chapters 6 and 7. Convolutional kernels have weights, inputs, and a bias; 
a weighted sum of these is produced using our most important equation; and the resulting 
z is passed through some nonlinear function to produce an activation. What has changed 
is that there isn’t a weight for every input, but rather a discrete kernel with 3x3 weights. 
These weights do not change as the kernel convolves; instead theyre shared across ali of 
the inputs. In this way, a convolutional layer can have orders of magnitude fewer weights 
than a fully connected layer. Another important point is that, like the inputs, the outputs 
from this kernel (all of the activations) are also arranged in a two-dimensional array. We’ll 
delve more into this in a moment, but first . . . 

Multiple Filters 

Typically, we have multiple filters in a given convolutional layer. Each filter enables 
the network to learn a representation of the data at a given layer in a unique way. For 
example, analogous to Hubel and Wiesels simple cells in the biological visual system 
(Figure 1.5), if the first hidden layer in our network is a convolutional layer, it might con- 
tain a kernel that responds optimally to vertical lines. Thus, whenever it convolves (slides 
over) a vertical line in an input image, it produces a large activation (a) value. Additional 
kernels in this layer can learn to represent other simple spatial features such as horizontal 
lines and color transitions (for examples, see the bottom-left panel of Figure 1.17). This 
is how these kernels came to be known as filters', they scan over the image and fdter out 
the location of specific features, producing high activations when they come across the 
pattern, shape, and/or color they are specially tuned to detect. One could say that they 
function as highlighters, producing a two-dimensional array of activations that indicate 
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where that filters particular feature exists in the original image. For this reason, the output 
from a kernel is referred to as an activation map. 

Analogous to the hierarchical representations of the biological visual system 
(Figure 1.6), subsequent convolutional layers receive these activation maps as their inputs. 
As the network gets deeper, the filters in the layers react to increasingly complex combi- 
nations of these simple features, learning to represent increasingly abstract spatial patterns 
and eventually building a hierarchy from simple lines and colors up to complex textures 
and shapes (see the panels along the bottom of Figure 1.17). In this way, later layers 
within the network have the capacity to recognize whole objects or even, say, to distin- 
guish an image of a Great Dane from that of a Yorkshire Terrier. 

The number of filters in the layer, like the number of neurons in a dense layer, is a 
hyperparameter that we configure ourselves. As with the other hyperparameters covered 
already in this book, there is a Goldilocks sweet spot for filter number. Here are our rules 
of thumb for homing in on it for your particular problem: 

■ Having a larger number ofkernels facilitates the identification of more-complex 
features, so consider the complexity of the data and the problem youre solving. Of 
course, more kernels comes with the cost of computation efficiency. 

■ If a network has multiple convolutional layers, the optimal number ofkernels for a 
given layer could vary quite a bit from layer to layer. Keep in mind that early lay¬ 
ers identify simple features, whereas later layers identify complex recombinations 
of these simple features, so let this guide where you stack your network. As we’11 
see when we get into coded examples of CNNs later in this chapter, a common 
approach for machine vision is to have many more kernels in later convolutional 
layers relative to early convolutional layers. 

■ As always, strive to minimize computational complexity: Consider using the small- 
est number ofkernels that facilitates a low cost on your validation data. If doubling 
the number ofkernels (say, from 32 to 64 to 128) in a given layer significantly 
decreases your models validation cost, then consider using the higher value. If halv- 
ing the number ofkernels (say, from 32 to 16 to 8) in a given layer doesn’t increase 
your models validation cost, then consider using the smaller value. 

A Convolutional Example 

Convolutional layers are a nontrivial departure from the simpler fully connected layers of 
Part II, so, to help you make sense of the way the pixel values and weights combine to 
produce feature maps, across Figures 10.3 through 10.5 we’ve created a detailed contrived 
example with accompanying math. To begin, imagine we’re convolving over a single 
RGB image thats 3x3 pixels in size. In Python, those data are stored in a [3,3,3] array 
as shown at the top of Figure 10.3. 8 


8. We admit that the RGB example of the tree has far more than nine pixels, but we struggled to identify a 
compelling color image that was 3x3. 
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Figure 10.3 This schematic diagram demonstrates how the activation values in a 
feature map are calculated in a convolutional layer. 


Shown in the middle of the figure are the 3x3 arrays for each of the three channels: 
red, green, and blue. Note that the image has been padded with zeros on all four sides. 
We’U discuss more about padding shortly, but for now all you need to know is this: 
Padding is used to ensure that the resulting feature map has the same dimensions as the 
input data. Below the arrays of pixel values you’11 find the weight matrices for each of 
the channels. We chose a kernel size of3x3, and given that there are three channels in 
the input image the weights matrix will be an array with dimensions [3,3,3], shown 
here individually. The bias term is 0.2. The current position of the filter is indicated by 
an overlay on each array of pixel values, and the z value (determined by calculating the 
weighted sum from Equation 10.1 across all three color channels, and then adding the 
bias as in Equation 10.2) is given at the bottom right of the figure. Finally, all of these 2 
values are sumrned to create the first entry in the feature map at the bottom right. 

Proceeding to Figure 10.4, the image arrays are now shown with the filter in its next 
position, one pixel to the right. Exactly as in Figure 10.3, the 2 value is calculated fol- 
lowing Equations 10.1 and 10.2. This 2 -value can then fili the second position in the 
activation map, as shown again at the bottom right of the figure. 
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Figure 10.4 A continuatiori of the convolutional example from Figure 10.3, now 
showing the activation for the next filter position 


This process is repeated for every possible filter position, and the z-value that was 
calculated for each of these nine positions is shown in the bottom-right corner of 
Figure 10.5. To convert this 3x3 map of z-values into a corresponding 3x3 activation 
map, we pass each z-value through an activation function, such as the ReLU function. 
Because a single convolutional layer nearly always has multiple filters, each producing its 
own two-dimensional activation map, activation maps have an additional depth dimension, 
analogous to the depth provided by the three the channels of an RGB image. Each of 
these kernel “channels” in the activation map represents a feature that that particular ker- 
nel specializes in recognizing, such as an edge (a straight line) at a particular orientation. 
Figure 10.6 shows how the calculation of activation values a from the input image build 
up a three-dimensional activation map. The convolutional layer that produced the activa¬ 
tion map shown in Figure 10.6 has 16 kernels, thus resulting in an activation map with a 
depth of 16 “channels” (we’ll call these slices going forward). 

In Figure 10.6, the kernel fdter is positioned over the top-left corner of the input 
image. This corresponds to 16 activation values in the top-left corner ofthe activation 
map: one activation a for each of the 16 kernels. By convolving over ali of the pixel 
Windows in the input image from left to right and from top to bottom, ali of the values 


9. Figure 1.17 shows real-world examples of the features individual kernels become specialized to detect across a 
range of convolutional-layer depths. In the first convolutional layer, for example, the majority of the kernels have 
a speciality in detecting an edge at a particular orientation. 
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Figure 10.5 Finally, the activation for the last filter position has been caiculated, and 

the activation map is complete. 




input activation map 

Figure 10.6 A graphical representation of the input array (left; represented here is a 
three-channel RGB image of size 32x32 with the kernel patch currently focused on the 
first—i.e., top-left—position) and the activation map (right). There are 16 kernels, 
resulting in an activation map with a depth of 16. Each position a given kernel occupies 
as it convolves over the input image corresponds to one a value in the resulting 

activation map. 
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in the activation map are filled in. 10 If the first of the 16 filters is tuned to respond op- 
timally to vertical lines, then the first slice of the activation map will highlight ali the 
physical regions of the input image that contain vertical lines. If the second filter is tuned 
to respond optimally to horizontal lines, then the second slice of the activation map will 
highlight regions ofthe image that contain horizontal lines. In this way, ali 16 filters 
in the activation map can together represent the spatial location of 16 different spatial 
features. * 11 


At this point, students of deep learning often wonder where the weights for a given 
convolutional kernel come from. In our examples in this section, ali of the parameter 
values have been contrived. In real-world convolutional layers, however, the kernel 
weights and biases are initialized with random values (as usual, per Chapter 9) and 
then learned through backpropagation, akin to the way weights and biases are learned 
in dense layers. As suggested by the hierarchical abstraction theme of Chapter 1, the 
earliest convolutional layers in a deep CNN tend to become tuned to simple features 
like straight lines at particular orientations, whereas deeper layers might specialize in 
representing, say, a face, a clock, or a dog. A four-minute video by Jason Yosinski and 
his colleagues (available at bi t. 1 y/DeepVi z) vividly demonstrates the specializations 
of convolutional kernels by ConvNet layer depth. 12 We highly recommend checking 
it out. 


Now that we’ve described the general principies underscoring convolutional layers in 
deep learning, ifs a good time to review the basic features: 

■ They allow deep learning models to learn to recognize features in a position invari- 
ant manner; a single kernel can identify its cognate feature anywhere in the input 
data. 

■ They remain faithful to the two-dimensional structure of images, allowing features 
to be identified within their spacial context. 

■ They significantly reduce the number of parameters required for modeling image 
data, yielding higher computational elficiency. 

■ Ultimately, they perform machine vision tasks (e.g., image classification) more 
accurately. 


f 10. Note that regardless of whether an input image is monochromatic (with only one color channel) or full-color 
(with three), there is only one activation map output for each convolutional kernel. If there is one color channel, we 
calculate the weighted sum of inputs for that single channel as in Equation 10.1. If there are three color channels, 
we calculate the total weighted sum of inputs across ali three channels as in Figures 10.3, 10.4, and 10.5. Either 
way (after adding the kerneFs bias and passing the resulting z-value through an activation function), we produce 
only one activation value for each position that each kernel convolves over. 

11. If you are interested in an interactive demonstration of convolutional-filter calculations, we highly recom¬ 
mend one created by Andrej Karpathy (see Figure 14.6 for a portrait). Its available at bi t. 1 y/CNNdemo under the 
Convolution Demo heading. 

12. Yosinski, J., et al. (2015). Understanding neural networks through deep visualization. Proceedings of the Interna¬ 
tional Conference on Machine Learning. 
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Convolutional Filter Hyperparameters 

In contrast with dense layers, convolutional layers are inherently not fully connected. 

That is, there isn’t a weight mapping every single pixel to every single neuron in the first 
hidden layer. Instead, there are a handful of hyperparameters that dictate the number of 
weights and biases associated with a given convolutional layer. These include: 

■ Kernel size 

■ Stride length 

■ Padding 

Kernel Size 

In ali of the examples covered so far in this chapter, the kernel size (also known as filter size 
or receptive fteld li ) has been 3 pixels wide and 3 pixels tali. This is a common size that is 
found to be effective across a broad range of machine vision applications in contemporary 
ConvNet architectures. A kernel size of 5x5 pixels is also popular, and 7x 7 is about 
as expansive as they ever get. If the kernel is too large with respect to the image, there 
would be too many competing features in the receptive field and it would be challenging 
for the convolutional layer to learn effectively, but if the receptive field is too small (e.g., 
2x2) it wouldn’t be able to tune to any structures, and that isn’t helpful either. 

Stride Length 

Stride refers to the size of the step that the kernel takes as it moves over the image. Across 
our convolutional-layer example (Figures 10.3 to 10.5) we use a stride length of 1 pixel, 
which is a frequently used option. Another common choice is a 2-pixel stride and, less 
often, a stride of 3. Anything much larger is likely to be suboptimal, because the ker¬ 
nel might skip regions of the image that are of value to the model. On the other hand, 
increasing the stride will yield an increase in speed because there are fewer calculations 
that need to be carried out. As ever in deep learning, it’s about finding a balance—that 
Goldilocks sweet spot—between these effects. We recommend a stride of 1 or 2, while 
avoiding anything larger than 3. 

Padding 

Next is padding, which plays handily with stride to keep the calculations of a convolu¬ 
tional layer in order. Let’s suppose you had a 28x28 MNIST digit and a 5x5 kernel. 
With a stride of 1, there are 24x24 “positions” for the kernel to move through before 
it burnps up against the edges of the image, so the activation map output by the layer is 
slightly smaller than the input. If you’d like to produce an activation map that is the exact 
same size as the input image, you can simply pad the image with zeros around the edges 
(Figures 10.3, 10.4, and 10.5 contain an example of a zero-padded image). In the case of 
the 28x28 image and the 5x5 kernel, padding with two zeros on each edge will 
produce a 28x28 activation map. This can be calculated with the following equation: 


A . . D — F + 2P 

Activation map = -—-1- 1 


( 10 . 3 ) 


13. The term receptive field is borrowed directly from the study of biological visual Systems like the eye. 




Pooling Layers 


169 


Where: 

■ D is the size of the image (either width or height, depending on whether youre 
calculating the width or height of the activation map). 

■ F is the size of the filter. 

■ P is the amount of padding. 

■ S is the stride length. 

Thus, with our padding of 2, we can calculate that the output volume is 28 x 28: 

D-F + 2P 1 

Activation map = -—-1- 1 

28 - 5 + 2 x 2 1 

Activation map = ---h 1 

Activation map = 28 


Given the interconnected nature of kernel size, stride, and padding, one has to make 
sure these hyperparameters align when designing CNN architectures. That is, the hyper- 
parameters must combine to produce a valid activation map size—specifically, an integer 
value. Take, for example, a kernel size of 5 x 5 with a stride of 2 and no padding. Using 
Equation 10.3, this would resuit in a 12.5x12.5 activation map: 


Activation map 
Activation map 


D-F + 2P 

~S + 

28 - 5 + 0 x 2 , 

-x-+1 


Activation map = 12.5 


There is no such thing as a partial activation value, so a convolutional layer with these 
dimensions would simply not be computable. 


Pooling Layers 

Convolutional layers frequently work in tandem with another layer type that is a staple in 
machine vision neural networks: pooling layers. This layer type serves to reduce the overall 
count of parameters in a network as well as to reduce complexity, thereby speeding up 
computation and helping to avoid overfitting. 

As discussed in the preceding section, a convolutional layer can have any number 
ofkernels. Each of these kernels produces an activation map (whose dimensions are 
defined by Equation 10.3), such that the output from a convolutional layer is a three- 
dimensional array of activation maps, with the depth dimension of the output corre- 
sponding to the number of filters in that convolutional layer. The pooling layer reduces 
these activation maps spatially, while leaving the depth of the activation maps intact. 

Like convolutional layers, any given pooling layer has a filter size and a stride length. 
Also like a convolutional layer, the pooling layer slides over its input. At each position it 
occupies, the pooling layer applies a data-reducing operation. Pooling layers most often 
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use the max operation, and these are termed max-pooling layers : They retain the largest 
value (the maximum activation) within the receptive field while discarding the other values 
(see Figure 10.7). 14 Typically, a pooling layer has a filter size of 2 x 2 and a stride length 
of 2. 15 In this case, at each position the pooling layer evaluates four activations, retaining 
only the maximum value, and thereby downsampling the activations by a factor of 4. 
Because this pooling operation happens independently for each depth slice in the three- 
dimensional array, a 28 x 28 activation map with a depth of 16 slices would be reduced to 
a 14 x 14 activation map but it would retain its full complement of 16 slices. 


An alternative approach to pooling for reducing computational complexity is to use 
a convolutional layer with a larger stride (see how stride relates to the output size in 
Equation 10.3). This can be handy for some specialized machine vision tasks (e.g., the 
generative adversarial networks you’ll build later in Chapter 12) that tend to perform 
better without pooling layers. Finally, you might be wondering what happens in a 
pooling layer during backpropagation: The network keeps track of the index of the 
max value in each forward pass, such that the gradient for that particular weight is 
backpropagated correctly and is used to update the correct parameters. 


max - 
pooling 
layer 


4x4 activation map 2x2 activation map 

Figure 10.7 An example of a max-pooling layer being passed a 4x4 activation map. 

Like a convolutional layer, the pooling layer slides from left to right and from top to 
bottom over the matrix of values input into it. With a 2x2-sized filter, the layer retains 
only the largest of four input values (e.g., the orange “5” in the 2x2 hatch-marked 
top-left corner). With a 2x2 stride, the resulting output from this max-pooling layer has 
one-quarter of the volume of its input: a 2x2 activation map. 
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14. Other pooling variants (e.g., average pooling, L2-norm pooling) exist but are much less common relative to max- 
pooling, which typically suits machine vision applications sufficiently accurately while requiring minimal compu¬ 
tational resources (it is, for example, more computationally expensive to calculate an average than a maximum). 

15. Max-pooling with a filter size of 2 X 2 with a stride of 2 is our default recommendation. Both, however, are 
hyperparameters that you can experiment with, if desired. 
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LeNet-5 in Keras 


All the way back at Figure 1.11, as we introduced the hierarchical nature of deep learn- 
ing, we discussed the machme vision architecture called LeNet-5. Ili this section, we use 
Keras to construet an MNIST digit-classifying model that is inspired by this landmark 
architecture. However, we afford Yann LeCun and his colleagues’ 1998 model some 
modern twists: 

■ Because computation is much cheaper today, we opt to use more kernels in our 
convolutional layers. More specifically, we include 32 and 64 fdters in the first and 
second convolutional layers, respectively, whereas the original LeNet-5 had only 6 
and 16 in each. 

■ Also thanks to cheap compute, we are subsampling activations only once (with a 
max-pooling layer), whereas LeNet-5 did twice. 16 

■ We leverage innovations like ReLU activations and dropout, which had not yet 
been invented at the time of LeNet-5. 

If you’d like to follow along interactively, please make your way to our LeNet in Keras 
Jupyter notebook. As shown in Example 10.1, relative to our previous notebook (Deep 
Net in Keras, covered in Chapter 9), we have three additional dependencies. 

Example 10.1 Dependencies for LeNet in Keras 

import keras 

from keras.datasets import mnist 

from keras.models import Sequential 

from keras.layers import Dense, Dropout 

from keras.layers import Conv2D, MaxPooling2D # new! 

from keras.layers import Flatten # new! 

Two of these dependencies— Conv2D and MaxPool i ng2D —are for implementing convo¬ 
lutional and max-pooling layers, respectively. The Flatten layer, meanwhile, enables us 
to collapse many-dimensional arrays down to one dimension. We’ll explain why that s 
necessary shortly when we build our model architecture. 

Next, we load our MNIST data in precisely the same way we did for all of the previ¬ 
ous notebooks involving handwritten digit classification (see Example 5.2). Previously, 
however, we reshaped the image data from its native two-dimensional representation to a 
one-dimensional array so that we could feed it into a dense network (see Example 5.3). 
The first hidden layer in our LeNet-5-inspired network will be convolutional, so we can 
leave the images in the 28x28-pixel format, as in Example 10.2. 17 


16. There is a general trend in deep learning to use pooling layers less frequently, presumably due to increasingly 
inexpensive computation costs. 

17. For any arrays passed into a Keras Conv2D() layer, a fourth dimension is expected. Given the monochromatic 
nature of the MNIST digits, we use 1 as the fourth-dimension argument passed into reshape(). If our data were 
full-color images we would have three color channels, and so this argument would be 3. 
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Example 10.2 Retaining two-dimensional image shape 

X_train = X_train.reshape(60000, 28, 28, 1),astype( 'float32' ) 

X_valid = X_valid.reshape(1 0000 , 28, 28, 1 ),astype( 'float32' ) 

We continue to use the astype() method to convert the digits from integers to floats so 
that they are scaled to range from 0 to 1 (as in Example 5.4). Also as before, we convert 
our integer y labeis to one-hot encodings (as in Example 5.5). 

The data loading and preprocessing behind us, we configure our LeNet-is/i model 
architecture as in Example 10.3. 

Example 10.3 CNN model inspired by LeNet-5 

model = Sequenti al() 

# first convolutional layer: 

model.add(Conv2D(32, kernel_size=(3, 3), acti vati on= 'rei u' , 
input_shape=(28, 28, 1))) 

# second conv layer, with pooling and dropout: 

model.add(Conv2D(64, kernel_size=(3, 3), acti vation= 'reiu ')) 
model.add(MaxPooling2D(pool_size=(2, 2))) 
model.add(Dropout(0. 25) ) 
model.add (FI atten()) 

# dense hidden layer, with dropout: 

model,add(Dense(128, activation=' reiu ')) 
model.add(Dropout(0 . 5)) 

# output layer: 

model,add(Dense(n_classes, activation='softmax')) 

All of the previous MNIST classifiers in this book have been dense networks, consisting 
only of Dense layers ofneurons. Here we use convolutional layers (Conv2D) as our first 
two hidden layers. 18 The settings we select for these convolutional layers are: 

■ The integers 32 and 64 correspond to the nuniber of filters we’re specifying for the 
first and second convolutional layer, respectively. 

■ kernel_size is set to 3x3 pixels. 

■ Were using rei u as our activation function. 


18. Conv2D() is our choice here because were convolving over two-dimensional arrays, that is, images. In 
Chapter 11, we’ll use Conv1D() to convolve over one-dimensional data (strings of text). Conv3D() layers also 
exist but are outside the scope of this book: These are for carrying out the convolutional operation over all three 
dimensions, as one might want to for three-dimensional medical images. 
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■ We’re using the default stride length, which is 1 pixel (along both the vertical and 
the horizontal axes). Alternative stride lengths can be specified by providing a 
strides argument to Conv2D. 

■ We’re using the default padding, which is ' val i d '. This means that we will forgo 
the use of padding: Per Equation 10.3, with a stride of 1, our activation map will 
be 2 pixels shorter and 2 pixels narrower than the input to the layer (e.g., a 28x28- 
pixel input image shrinks to a 26x26 activation map). The alternative would be to 
specify the argument paddi ng=' same' , which would pad the input with zeros so 
that the output retains the same size as the input (a 28x28-pixel input image results 
in a 28x28 activation map). 

To our second hidden layer of neurons, we add a number of additional layers of com- 
putational operations: 19 

■ MaxPool i ng2D( ) is used to reduce computational complexity. As in our example in 
Figure 10.7, with pool_size set to 2 x 2 and the strides argument left at its 
default (None, which sets stride length equal to pool size), we are reducing the 
volume of our activation map by three-quarters. 

■ As per Chapter 9, Dropout ( ) reduces the risk of overfitting to our training data. 

■ Finally, Flatten() converts the three-dimensional activation map output by 
Conv2D( ) to a one-dimensional array. This enables us to feed the activations as 
inputs into a Dense layer, which can only accept one-dimensional arrays. 

As already discussed in this chapter, the convolutional layers in the network learn to 
represent spatial features within the image data. The first convolutional layer learns to 
represent simple features like straight lines at a particular orientation, whereas the second 
convolutional layer recombines those simple features into more-abstract representations. 
The intuition behind having a Dense layer as the third hidden layer in the network is that 
it allows the spatial features identified by the second convolutional layer to be recombined 
in any way thats optimal for distinguishing classes of images (there is no sense of spatial 
orientation within a dense layer). Put differently, the two convolutional layers learn to 
identify and label spatial features in the images, and these spatial features are then fed into 
a dense layer that maps these spatial features to a particular class of images (e.g., the digit 
“3” as opposed to the digit “8”). In this way, the convolutional layers can be thought of 
as feature extractors. The dense layer of the network receives the extracted features as its 
input, instead of raw pixels. 

We apply Dropout () to the dense layer (again to avoid overfitting), and the network 
then culminates in a softmax output layer—identical to the output layers we have used 
in ali ofour previous MNIST-classifying notebooks. Finally, a call to model . summary() 
prints out a summary of our CNN architecture, as shown in Figure 10.8. 


19. Layer types such as pooling, dropout, and flattening layers aren’t made up of artificial neurons, so they don’t 
count as stand-alone hidden layers of a deep learning network like dense or convolutional layers do. They nev- 
ertheless perform valuable operations on the data flowing through our neural network, and we can use the Keras 
add () method to include them in our model architecture in the same way that we add layers of neurons. 
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Layer (type) 

Output 

Shape 

Param # 

conv2d_l (Conv2D) 

(None, 

26, 26, 32) 

320 

conv2d_2 (Conv2D) 

(None, 

24, 24, 64) 

18496 

max_pooling2d_l (MaxPooling2 

(None, 

12, 12, 64) 

0 

dropout_l (Dropout) 

(None, 

12, 12, 64) 

0 

flatten_l (Flatten) 

(None, 

9216) 

0 

dense_l (Dense) 

(None, 

128) 

1179776 

dropout_2 (Dropout) 

(None, 

128) 

0 

dense_2 (Dense) 

(None, 

10) 

1290 

Total params: 1,199,882 

Trainable params: 1,199,882 

Non-trainable params: 0 


Figure 10.8 A summary of our LeNet-5-inspired ConvNet architecture. Note that the 
None dimension in each layer is a placeholder for the number of images per batch (i.e., 
the stochastic gradient descent mini-batch size). Because batch size is specified later (in 
the model . fi t () method), None is used in the interim. 


Lefs break down the “Output Shape” column of Figure 10.8 first: 

■ The first convolutional layer, conv2d_1, takes in the 28x28-pixel MNIST digits. 
With the chosen kernel hyperparameters (fdter size, stride, and padding), the layer 
outputs a 26x26-pixel activation map (as per Equation 10.3). 211 With 32 kernels, 
the resulting activation map has a depth of 32 slices. 

■ The second convolutional layer receives as its input the 26x26x32 activation 
map from the first convolutional layer. The kernel hyperparameters are unchanged, 
so the activation map shrinks again, now down to 24 X 24. The map is, however, 
twice as deep because there are 64 kernels in the layer. 

■ As discussed earlier, a max-pooling layer with a kernel size of 2 and a stride of 2 
reduces the volume of data flowing through the network by half in each of the spa- 
tial dimensions, yielding an activation map of 12 X 12. The depth ofthe activation 
map is not affected by pooling, so it retains 64 slices. 

■ The flatten layer collapses the three-dimensional activation map down to a one- 
dimensional array with 9,216 elements. 21 


20. Activation map = D ~ F + 2P + 1 = 28 ~ 3 + 2x0 + 1 = 26 
21.12x12x64 = 9,216 
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■ The dense hidden layer contains 128 neurons, so its output is a one-dimensional 
array of 128 activation values. 

■ Likewise, the softmax output layer consists of 10 neurons, so it outputs 10 
probabilities — one y for each possible MNIST digit. 

Now lets move on to dissecting the “Param #’’ column of Figure 10.8: 

■ The first convolutional layer has 320 parameters: 

■ 288 weights: 32 fdters x 9 weights each (from the 3x3 filter size x 1 chan- 
nel) 

■ 32 biases, one for each filter 

■ The second convolutional layer has 18,496 parameters: 

■ 18,432 weights: 64 filters x 9 weights per filter, each receiving input from 
the 32 filters of the preceding layer 

■ 64 biases, one for each filter 

■ The dense hidden layer has 1,179,776 parameters: 

■ 1,179,648 weights: 9,216 inputs from the preceding layers flattened activation 
map X 128 neurons in the dense layer 22 

■ 128 biases, one for each neuron in the dense layer 

■ The output layer has 1,290 parameters: 

■ 1,280 weights: 128 inputs from the preceding layer X 10 neurons in the 
output layer 

■ 10 biases, one for each neuron in the output layer 

■ Cumulatively, the entire ConvNet has 1,199,882 parameters, the vast majority (98.3 
percent) of which are associated with the dense hidden layer. 

To compile the model, we call the model . compi 1 e () method as usual. Likewise, the 
model . f i t () method will begin training. 23 The results of our best epoch are shown in 
Figure 10.9. Previously, our best resuit was attained by Deep Net in Keras —an accuracy 
of 97.87 percent on the validation set of MNIST digits. But here, the ConvNet inspired 
by LeNet-5 achieved 99.27 percent validation accuracy. This is fairly remarkable because 
the CNN wiped away 65.7 percent ofthe remaining error; 24 presumably these now 
correctly classified instances are some of the trickiest digits to classify because they were 
not identified correctly by our already solid-performing Deep Net. 


Epoch 9/10 

60000/60000 [==========================«=! - 39s 654us/step - loss: 0.0276 - acc: 0.9911 - ual_loss: 0.0260 - val.acc: 0.9927 


Figure 10.9 Our LeNet-5-inspired ConvNet architecture peaked at a 99.27 percent 
validation accuracy following nine epochs of training, thereby outperforming the accuracy 
of the dense nets we trained earlier in the book. 


22. Notice that the dense layer has two orders of magnitude more parameters than the convolutional layers! 

23. These steps are identical to the previous notebooks, with the minor exception that the number of epochs is 
reduced (to 10), because we found that validation loss stopped decreasing after nine epochs of training. 

24. 1 - (100%-99.27%)/(100%-97.87%) 
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Figure 10.10 A general approach to CNN design: A block (shown in red) of 
convolutionai layers (often one to three of them) and a pooling layer is repeated several 
times. This is followed by one (up to a few) dense layers. 


AlexNet and VGGNet in Keras 


In our LeNet-inspired architecture (Example 10.3), we included a pair of convolutionai 
layers followed by a max-pooling layer. This is a routine approach within convolutionai 
neural networks. As depicted in Figure 10.10, it is common to group convolutionai layers 
(often one to three of them) together with a pooling layer. These conv-pool blocks can 
then be repeated several times. As in LeNet-5, such CNN architectures regularly cul- 
minate in a dense hidden layer (up to several dense hidden layers) and then the output 
layer. 

The AlexNet model (Figure 1.17)—which we introduced as the 2012 computer vision 
competition-winning harbinger of the deep learning revolution—is another architec¬ 
ture that features the convolutionai layer block approach provided in Figure 10.10. In 
our AlexNet in Keras notebook, we use the code shown in Example 10.4 to emulate this 
structure. 25 

Example 10.4 CNN model inspired by AlexNet 

model = Sequential() 

# first conv-pool block: 

model .add(Conv2D(96, kernel_size=(11, 11), 

strides=(4, 4), activation=' reiu' , 
input_shape=(224, 224, 3))) 


25. This AlexNet model architecture is the same one visualized by Jason Yosinski with his DeepViz tool. If you 
didn’t view his video when we mentioned it earlier in this chapter, then we recommend checking it out at 
bi t. 1 y/DeepVi z now. 
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model.add(MaxPoo1ing2D(pool_size=(3, 3), strides=(2, 2))) 
mode!.add(BatchNormalization()) 

# second conv-pool block: 

mode!,add(Conv2D(256, kernel_size=(5, 5), activation=' reiu ')) 
model,add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2))) 
model.add(BatchNormalization()) 

# third conv-pool block: 

model,add(Conv2D(256, kernel_size=(3, 3), activation=' reiu ')) 
model,add(Conv2D(384, kernel_size=(3, 3), activation=' reiu ')) 
model,add(Conv2D(384, kernel_size=(3, 3), activation=' reiu ')) 
model,add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2))) 
model.add(BatchNormalization()) 

# dense layers: 

model.add(FIatten()) 

model.add(Dense (4096 , acti vation= 'tanh' )) 
model.add(Dropout(0.5)) 

model.add(Dense (4096 , acti vation= 'tanh' )) 
model.add(Dropout(0.5)) 

# output layer: 

model.add(Dense(1 7 , acti vation= 'softmax' )) 

The key points about this particular model architecture are: 

■ For this notebook, we moved beyond the MNIST digits to a dataset of larger-sized 
(224x224-pixel) images that are full-color (hence the 3 channels of depth in the 

i nput_shape argument passed to the first Conv2D layer). 

■ AlexNet used larger filter sizes in the earliest convolutional layers relative to what is 
popular today—for example, kernel_si ze= (11, 11). 

■ Such use of dropout in only the dense layers near the model output (and not in the 
earlier convolutional layers) is common. The intuition behind this is that the early 
convolutional layers enable the model to represent spatial features of images that 
generalize well beyond the training data. However, a very specific recombination 
of these features, as facilitated by the dense layers, may be unique to the training 
dataset and thus may not generalize well to validation data. 


The AlexNet and VGGNet (more about this in a moment) model architectures 
are very large (AlexNet, for example, has 21.9 million parameters), and you may 
need to increase the memory available to Docker on your machine to load it. See 
bi t. 1 y/DockerMem for instructions on how to do this. 
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Following AlexNet being crowned the 2012 winner of the ImageNet Large Scale 
Visual Recognition Challenge, deep learning models suddenly began to be used widely 
in the conipetition (see Figure 1.15). Among these models, there has been a general trend 
toward making the neural networks deeper and deeper. For example, in 2014 the runner- 
up in the ILSVRC was VGGNet, 26 which follows the sarne repeated conv-pool-block 
structure as AlexNet; VGGNet simply has more ofthem, and with smaller (ali 3x3-pixel) 
kernel sizes. We provide the architecture shown in Example 10.5 in our VGGNet in Keras 
notebook. 

Example 10.5 CNN model inspired by VGGNet 

model = Sequential() 

model.add(Conv2D(64, 3, acti vation= 'reiu' , 
input_shape= (224 , 224, 3))) 
model.add(Conv2D(64, 3, acti vation= 'reiu ')) 
model.add(MaxPooling2D(2, 2)) 
model.add(BatchNormalization()) 

model,add(Conv2D(128, 3, acti vation= 'reiu ')) 
model,add(Conv2D(128, 3, acti vation= 'reiu ')) 
model.add(MaxPooling2D(2, 2)) 
model.add(BatchNormalization()) 

model.add(Conv2D(256, 3, acti vation= 'reiu ')) 
model.add(Conv2D(256, 3, acti vation= 'reiu ')) 
model.add(Conv2D(256, 3, acti vation= 'reiu ')) 
model.add(MaxPooling2D(2, 2)) 
model.add(BatchNormal izati on()) 

model.add(Conv2D(51 2 , 3, acti vation= 'reiu ')) 
model.add(Conv2D(51 2 , 3, acti vation= 'reiu ')) 
model.add(Conv2D(51 2 , 3, acti vation= 'reiu ')) 
model.add(MaxPooling2D(2, 2)) 
model.add(BatchNormalization()) 

model.add(Conv2D(51 2 , 3, acti vation= 'reiu ')) 
model.add(Conv2D(51 2 , 3, acti vation= 'reiu' )) 
model.add(Conv2D(51 2 , 3, acti vation= 'reiu ')) 
model.add(MaxPooling2D(2, 2)) 
model.add(BatchNormalization()) 


26. Developed by the Kisual Geometry Group at the University of Oxford: Simonyan, K., and Zisserman, A. 
(2015). Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556. 
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model.add (FI atten()) 

mode!.add(Dense (4096 , acti vation= 'reiu' )) 
model.add(Dropout (0.5)) 

model.add(Dense (4096 , acti vation= 'reiu' )) 
model.add(Dropout (0.5)) 

model.add(Dense(1 7 , acti vation=' softmax' )) 


Residual Networks 


As our example ConvNets from this chapter (LeNet-5, AlexNet, VGGNet) suggest, 
theres a trend over time toward deeper networks. In this section, we recapitulate the 
topic of vanishing gradients—the (often dramatic) slowing in learning that can occur as 
network architectures are deepened. We then describe an iniaginative solution that has 
emerged in recent years: residual networks. 

Vanishing Gradients: The Bete Noire of Deep CNNs 

With more layers, models are able to learn a larger variety of relatively low-level features 
in the early layers, and increasingly complex abstractions are made possible in the later 
layers via nonlinear recombination. This approach, however, has limits: II we continue 
to simply make our networks deeper (e.g., by adding more and more of the conv-pool 
blocks from Figure 10.10), they will eventually be debilitated by the vanishing gradient 
problem. 

We introduced vanishing gradients in Chapter 9; the basis of the issue is that parame- 
ters in early layers of the network are far away from the cost function: the source of the 
gradient that is propagated backward through the network. As the error is backpropa- 
gated, a larger and larger number of parameters contribute to the error, and thus each 
layer closer to the input gets a smaller and smaller update. The net effect is that early 
layers in increasingly deep networks become more diificult to train (see Figure 8.8). 

Because of the vanishing gradient problem, it is commonly observed that as one 
increases the depth of a network, accuracy increases up to a saturation point and then 
later begins to degrade as networks become excessively deep. Imagine a shallow net¬ 
work that is performing well. Now let’s copy those layers with their weights, and stack 
on new layers atop to make the model deeper. Intuition might say that the new, deeper 
model would take the existing gains from the early pretrained layers and improve. If the 
new layers performed simple identity mapping (wherein they faithfully reproduced the 
exact results of the earlier layers), then we cl see no increase in training error. It turns out, 
however, that plain deep networks struggle to learn identity functions. 27,28 Thus, these 
new layers either add new information and decrease the error, or they do not add new 
information (but also fail at identity mapping) and the error increases. Given that adding 


27. Hardt, M., and Ma, T. (2018). Identity matters in deep learning. arXiv: 1611.04231. 

28. Hold tight! More clarification is coming up on the ternis identity mapping and identity functions shortly. 
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useful information is an exceedingly rare outcome (relative to the baseline, which is 
essentially random noise), it transpires that beyond a certain point these extra layers will, 
probabilistically, contribute to an overall degradation in performance. 

Residual Connections 

Residual networks (or ResNets, for short) rely on the idea of residual connections, 
which exist within so-called residual modules. A residual module—as illustrated in 
Figure 10.11—is a collective terni for a sequence of convolutions, batch-normahzation 
operations, and ReLU activations that culminates with a residual connection. For the 
sake of simphcity here, we consider these various layers within a residual module to be 
a single, discrete unit. Following on with the most straightforward defmition, a residual 
connection exists when the input to one such residual module is summed with its output to 
produce the final activation for that residual module. In other words, a residual module 
will receive some input a,_i , 29 which is transformed by the convolutions and activation 
functions within the residual module to generate its output a i. Subsequently, this output 
and the original input to the residual module are summed: y i = di + a.;_i. 30 

Following the structure and the basic math of the residual connection from the preced- 
ing paragraph, you’U notice that an interesting feature emerges: If the residual module has 
an activation a.; = 0—that is, it has learned nothing—the final output of the residual 
module will simply be the original input, since the two are summed. Following on with 
the equation we used most recently: 


y z = a { + a,-_i 

= 0 + ai -1 
= a ^—i 

In this case, the residual module is effectively an identity function. These residual modules 
either learn something useful and contribute to reducing the error of the network, or 
they perform identity mapping and do nothing at ali. Because of this identity-mapping 
behavior, residual connections are also called “skip connections,” because they enable 
information to skip the functions located within the residual module. 

In addition to this neutral-or-better characteristic of residual networks, we should also 
highlight the value oftheir inherent multiphcity. Consider the schematic in Figure 10.12: 
When several residual modules are stacked, later residual modules receive inputs that are 
increasingly complex combinations of the residual modules and skip connections ffom 
earlier in the network. Seen on the right in this figure, a decision tree representation 
shows how, at each of the three residual modules in the network, information may either 
pass through the residual block or bypass it via a skip connection. Thus, as is shown at the 
bottom of the figure, with only three residual modules there are eight possible paths the 
information can take. In practice, the process is not commonly as binary as it is depicted 


29. Remember that the input to any given layer is simply the output of the preceding layer, denoted here by a^-i. 

30. WeVe opted to denote the final output of the whole residual module as y j, but that does not mean to indicate 
that this is necessarily the final output of the entire model. It simply serves to avoid confusion with the activations 
from the current and preceding layers, indicating that the final output is a distinet entity derived from the sum of 
those activations. 
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input 



Figure 10.11 A schematic representation of a residual module. Batch normalization 
and dropout layers are not shown, but may be included. 
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Figure 10.12 Shown at left is the conventional representation of residual blocks within 
a residual network. Shown at right is an unraveled view, which demonstrates how, 
depending on which skip connections are used, the final path of information from input 
to output can be varied by the network. 


in this figure. That is, the value of a, is seldom 0, and therefore the output is usually 
sorne mix of the identity function and the residual module. Given this insight, residual 
networks can be thought of as complex combinations or ensembles of many shallower 
networks that are pooled at various depths. 
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ResNet 

The first deep residual network, ResNet, was introduced by Microsoft Research in 
2015 31 and won first place in that years ILSVRC image-classification competition. 
Referring back to Figure 1.15, this makes ResNet the leader of the pack of deep learning 
algorithms that surpassed human performance at image recognition in 2015. 

Up to this point in the book, we’ve made it sound as if image classification is the only 
contest at ILSVRC, but in fact ILSVRC has several machine vision competition cate- 
gories, such as object detection and image segmentatiori (more on these two machine vision 
tasks coming soon in this chapter). In 2015, ResNet took first place not only in the 
ILSVRC image-classification competition but in the object detection and image segmen- 
tation categories, too. Further, in the same year, ResNet was also recognized as Champion 
of the detection and segmentation competitions involving an alternative image dataset 
called COCO, which is an alternative to the ILSVRC set. 32 

Given the broad sweep of machine vision trophies upon the invention of residual 
networks, it’s ciear they were a transformative innovation. They managed to squeeze 
out more juice relative to the existing networks by enabling much deeper architectures 
without the decrease in performance associated with those extra layers if they fail to learn 
useful information about the problem. 

In this book, we strive to make our code examples accessible to our readers by having 
model architectures and datasets that are small enough to carry out training on even a 
modest laptop computer. Residual network architectures, as well as the datasets that make 
them worthwhile, do not fall into this category. That said, using a powerful, general 
approach called transfer learning —which we will introduce at the end of this chapter—we 
provide you with resources to nevertheless take advantage of very deep architectures like 
ResNet with the models parameters pretrained on massive datasets. 

Applications of Machine Vision 

In this chapter, youve learned about layer types that enable machine vision models to 
perform well. We’ve also discussed some of the approaches that are used to improve 
these models, and we’ve delved into some of the canonical machine vision algorithms 
of the past few years. Up to here in the chapter, we’ve dealt with the problem of image 
classification—that is, identifying the main subject in an image, as seen at the left in 
Figure 10.13. Now, to wrap up the chapter, we turn our focus to other interesting appli- 
cations of machine vision beyond image classification. The first is object detection, seen 
in the second panel from the left in Figure 10.13, wherein the algorithm is tasked with 
drawing bounding boxes around objects in an image. Next is image segmentation, shown 
in the third and fourth panels of Figure 10.13. Semantic segmentation identifies ali objects 
of a particular class down to the pixel level, whereas instance segmentation discriminates 
between different instances of a particular class, also at the pixel level. 


31. He, K., et al. (2015). Deep residual learning for image recognition. arXiv: 1512.03385. 

32. cocodataset.org 
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CLASSI FICATION DETECTION SEGMENTATION SEGMENTATION 


“BALLOONS” 



Figure 10.13 These are examples of various machine Vision applications. We have 
encountered classification previously in this chapter, but now we cover object detection, 
semantic segmentation, and instance segmentation. 


Object Detection 

Imagine a photo of a group of people sitting down to dinner. There are several people 
in the image. There is a roast chicken in the middle of the table, and maybe a bottle of 
wine. If we desired an automated system that could predict what was served for dinner 
or to identify the people sitting at the table, an image-classification algorithm would not 
provide that level of granularity—enter object detection. 

Object detection has broad applications, such as detecting pedestrians in the field 
of view for autonomous driving, or for identifying anomalies in medical images. Gen- 
erally speaking, object detection is divided into two tasks: detection (identifying where 
the objects in the image are) and then, subsequently, classification (identifying what the 
objects are that have been detected). Typically this pipelme has three stages: 

1. A region of interest must be identified. 

2. Automatic feature extraction is performed on this region. 

3. The region is classified. 

Seminal models—ones that have defined progress in this area—include R-CNN, Fast 
R-CNN, Faster R-CNN, and YOLO. 

R-CNN 

R-CNN was proposed in 2013 by Ross Girshick and his colleagues at UC Berkeley. 33 
The algorithm was modeled on the attention mechanism of the human brain, wherein an 
entire scene is scanned and focus is placed on specific regions of interest. To emulate this 
attention, Girshick and his coworkers developed R-CNN to: 

1. Perform a selective search for regions of interest (ROIs) within the image. 

2. Extract features from these ROIs by using a CNN. 

3. Combine two “traditional” (as in Figure 1.12) machine learning approaches—called 
linear regression and support vector machines —to, respectively, refme the locations of 
bounding boxes 34 and classify objects within each of those boxes. 


33. Girshnick, R., et al. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation. 
arXiv: 1311.2524. 

34. See examples of bounding boxes in Figure 10.14. 
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R-CNNs redefined the state ofthe art in object detection, achieving a massive gain 
in performance over the previous best model in the Pattern Analysis, Statistical Modeling 
and Computational Learning (PASCAL) Visual Object Classes (VOC) competition. 35 
This ushered in the era of deep learning in object detection. However, this model had 
some limitations: 

■ It was inflexible: The input size was fixed to a single specific image shape. 

■ It was slow and computionally expensive: Both training and inference are multistage 
processes involving CNNs, linear regression models, and support vector machines. 

Fast R-CNN 

To address the primary drawback of R-CNN—its speed—Girshick went on to develop 
Fast R-CNN. 36 The chief innovation here was the realization that during step 2 of the 
R-CNN algorithm, the CNN was unnecessarily being run multiple times, once for each 
region ofinterest. With Fast R-CNN, the ROI search (step 1) is run as before, but dur¬ 
ing step 2, the CNN is given a single global look at the image, and the extracted features 
are used for all ROIs simultaneously. A vector of features is extracted from the final layer 
of the CNN, which (for step 3) is then fed into a dense network along with the ROI. 

This dense net learns to focus on only the features that apply to each individual ROI, 
culminating in two outputs per ROI: 

1. A softmax probability output over the classification categories (for a prediction of 
what class the detected object belongs to) 

2. A bounding box regressor (for refmement of the ROIs location) 

Following this approach, the Fast R-CNN model has to perform feature extraction 
using a CNN only once for a given image (thereby reducing computational complexity), 
and then the ROI search and dense layers work together to finish the object-detection 
task. As the name suggests, the reduced computational complexity of Fast R-CNN cor- 
responds to speedier compute times. It also represents a single, unified model without the 
multiple independent parts of its predecessor. Nevertheless, as with R-CNN, the initial 
(ROI search) step of Fast R-CNN stili presents a significant computational bottleneck. 

Faster R-CNN 

The model architectures in this section are elever works of innovation—their names, 
however, are not. Our third object-detection algorithm of note is Faster R-CNN, which 
(you guessed it!) is even swifter than Fast R-CNN. 

Faster R-CNN was revealed in 2015 by Shaoqing Ren and his coworkers at Microsoft 
Research (Figure 10.14 shows example outputs). 37 To overcome the ROI-search bot¬ 
tleneck of R-CNN and Fast R-CNN, Ren and his colleagues had the cunning insight 
to leverage the feature activation maps from the models CNN for this step, too. Those 
activation maps contain a great deal of contextual information about an image. Because 


35. PASCAL VOC ran competitions from 2005 until 2012; the dataset remains available and is considered one of 
the gold standards for object-detection problems. 

36. Girshnick, R. (2015). Fast R-CNN. arXiv: 1504.08083 

37. Ren, S. et al. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. 
arXiv: 1506.01497. 
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Figure 10.14 These are examples of object detection (performed on four separate 
images by the Faster R-CNN algorithm). Within each region of interest—defined by the 
bounding boxes within the images—the algorithm predicts what the object within the 

region is. 


each map has two dimensions representing location, they can be thought of as literal maps 
of the locations offeatures within a given image. If—as in Figure 10.6—a convolutional 
layer has 16 filters, the activation map it outputs has 16 maps, together representing the 
locations of 16 features in the input image. As such, these feature maps contain rich de- 
tail about what is in an image and where it is. Faster R-CNN takes advantage of this rich 
detail to propose ROI locations, enabling a CNN to seamlessly perform all three steps of 
the object-detection process, thereby providing a unified model architecture that builds 
on R-CNN and Fast R-CNN but is markedly quicker. 

YOLO 

Within each of the various object-detection models described thus far, the CNN focused 
on the individual proposed ROIs as opposed to the whole input image. 38 Joseph Red- 
mon and coworkers published on You Only Look Once (YOLO) in 2015, which bucked 
this trend. 39 YOLO begins with a pretrained 40 CNN for feature extraction. Next, the 
image is divided into a series of cells, and, for each cell, a number of bounding boxes and 


38. TechnicaUy, the CNN looked at the whole image at the start in both Fast R-CNN and Faster R-CNN. 
However, in both cases this was simply a one-shot step to extract features, and from then onward the image was 
treated as a set of smaller regions. 

39. Redmon, J., et al. (2015). You Only Look Once: Unified, real-time object detection. arXiv: 1506.02640. 

40. Pretrained models are used in transfer learning, which we detail at chapter s end. 
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object-classification probabilities are predicted. Bounding boxes with class probabilities 
above a threshold value are selected, and these combine to locate an object within an 
image. 

You can think of the YOLO rnethod as aggregating many smaller bounding boxes, but 
only if they have a reasonably good probability of containing any given object class. The 
algorithm improved on the speed of Faster R-CNN, but it struggled to accurately detect 
small objects in an image. 

Since the original YOLO paper, Redmon and his colleagues have released their 
YOLO9CI00 41 and YOLOv3 niodels. 42 YOLO9000 resulted in increases in both 
execution speed and model accuracy, and YOLOv3 yielded sonte speed for even further 
improved accuracy—in large part due to the increased sophistication of the underly- 
ing model architectures. The details of these continuations stretch beyond the scope of 
this book, but at the time ofwriting these niodels represent the cutting edge of object- 
detection algorithms. 

Image Segmentatiori 

When the visual field of a human is exposed to a real-world scene containing many over- 
lapping visual elements—such as the game of association football (soccer) captured in 
Figure 10.15—the adult brain seenis to effortlessly distinguish figures from the back- 
ground, defming the boundaries of these figures and relationships between them 



Figure 10.15 This is an example of image segmentation (as performed by the Mask 
R-CNN algorithm). Whereas object detection involves defining object locations with 
coarse bounding boxes, image segmentation predicts the location of objections to the 

pixel level. 


41. Redmon, J., et al. (2016). YOL09000: Better, faster, stronger. arXiv: 1612.08242. 

42. Redmon, J. (2018). YOLOv3: An incremental improvement. arXiv: 1804.02767. 
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within a few hundred milliseconds. In this section, we cover image segmentatiori, another 
application area where deep learning has in a few short years bridged much of the gap 
in visual capability between humans and machines. We focus on two prominent model 
architectures—Mask R-CNN and U-Net—that are able to reliably classify objects in an 
image on a pixelwise scale. 

Mask R-CNN 

Mask R-CNN was developed by Facebook AI Research (FAIR) in 2017. 43 This 
approach involves: 

1. Using the existing Faster R-CNN architecture to propose ROIs within the image 
that are likely to contain objects. 

2. An ROI classifier predicting what kind of object exists in the bounding box while 
also refining the location and size of the bounding box. 

3. Using the bounding box to grab the parts of the feature maps from the underlying 
CNN that correspond to that part of the image. 

4. Feeding the feature maps for each ROI into a fully convolutional network that 
outputs a mask indicating which pixels correspond to the object in the image. An 
example ofsuch a mask—consisting ofbright colors to designate the pixels associ- 
ated with separate objects—is provided in Figure 10.15. 

Image segmentation problems require binary masks as labeis for training. These consist 
of arrays ofthe same dimensions as the original image. However, instead ofRGB pixel 
values they contain ls and Os indicating where in the image the object is, with the ls 
representing a given objecfs pixel-by-pixel location (and the Os representing everywhere 
else). If an image contains a dozen different objects, then it must have a dozen binary 
masks. 

U-Net 

Another popular image segmentation model is U-Net, which was developed at the Uni- 
versity of Freiberg (and was mentioned at the end of Chapter 3 with respect to the auto- 
mated photo-processing pipelines). 44 U-Net was created for the purpose ofsegmenting 
biomedical images, and at the time of writing it outperformed the best available methods 
in two challenges held by the International Symposium on Biomedical Images. 45 

The U-Net model consists of a fully convolutional architecture, which begins with 
a contracting path that produces successively smaller and deeper activation maps through 
multiple convolution and max-poolmg steps. Subsequently, an expanding path restores 
these deep activation maps back to full resolution through multiple upsampling and con¬ 
volution steps. These two paths—the contracting and expanding paths—are symmetrical 
(forming a “U” shape), and because of this symmetry the activation maps from the con¬ 
tracting path can be concatenated onto those of the expanding path. 


43. He, K., et al. (2017). Mask R-CNN. arXiv: 1703.06870. 

44. Ronneberger, O., et al. (2015). U-Net: Convolutional networks for biomedical image segmentation. arXiv: 
1505.04597. 

45. The two challenges were the segmentation of neuronal structures in electron microscopy stacks, and the ISBI 
cell-tracking challenge from 2015. 
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The contracting path serves to allow the model to learn high-resolution features from 
the image. These high-res features are handed directly to the expanding path. By the end 
of the expanding path, we expect the model to have localized these features within the 
fmal image dimensions. After concatenating the feature maps from the contracting path 
onto the expanding path, a subsequent convolutional layer allows the network to learn to 
assemble and localize these features precisely. The fmal resuit is a network that is highly 
adept both at identifying features and at locating those features within two-dimensional 
space. 


Transfer Learning 

To be effective, many of the models we describe in this chapter are trained on very large 
datasets of diverse images. This training requires significant compute resources, and the 
datasets themselves are not cheap or easy to assemble. Over the course of this training, a 
given CNN learns to extract general features from the images. At a low level, these are 
lines, edges, colors, and simple shapes; at a higher level, they are textures, combinations 
of shapes, parts of objects, and other complex visual elements (recall Figure 1.17). If the 
CNN has been trained on a suitably varied set of images and if it is sufficiently deep, these 
feature maps likely contain a rich library of visual elements that can be assembled and 
combined to form nearly any image. For example, a feature map that identifies a dim- 
pled texture combined with another that recognizes round objects and yet another that 
responds to white colors could be recombined to correctly identify a golf ball. Transfer 
learning takes advantage of this library of existing visual elements contained within the fea¬ 
ture maps of a pretrained CNN and repurposes them to become specialized in identifying 
new classes of objects. 

Say, for example, that you’d like to build a machine vision model that performs the 
binary classification task we’ve addressed time and again since Chapter 6: distinguishing 
hot dogs from anything that is not a hot dog. Of course, you could design a large and 
complex CNN that takes in images of hot dogs and, well . . . not hot dogs, and outputs a 
single sigmoid class prediction. You could train this model on a large number of training 
images, and you'd expect the convolutional layers early in the network to learn a set of 
feature maps that will identify hot dog -esque features. Frankly, this would work pretty 
well. However, you’d need a lot of time and a lot of compute power to train the CNN 
properly, and you'd need a large number of diverse images so that the CNN could learn 
a suitably diverse set of feature maps. This is where transfer learning comes in: Instead 
of training a model from scratch, you can leverage the power of a deep model that has 
already been trained on a large set of images and quickly repurpose it to detecting hot 
dogs specifically. 

Earlier in this chapter, we mentioned VGGNet as an example of a classic machine 
vision model architecture. In Example 10.5 and in our VGGNet in Keras Jupyter note- 
book, we showcase the VGGNetl6 model, which is composed of 16 layers ofartificial 
neurons—mostly repeating conv-pool blocks (see Figure 10.10). The closely related 
VGGNetl9 model, which incorporates one further conv-pool block (containing three 
convolutional layers), is our pick for our transfer-learning starting point. In our accompa- 
nying notebook, Transfer Learning in Keras, we load VGGNetl9 and modify it for our own 
hot-doggy purposes. 
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The chief advantage of VGG19 over VGG16 is that VGG19s additional layers afford 
it additional opportunities for the abstract representation of visual imagery. The chief 
disadvantage of VGG19 relative to VGG16 is that these additional layers mean more 
parameters and therefore a longer training time. Further, because of the vanishing 
gradient problem, backpropagation may struggle through VGG19’s additional early 
layers. 


To start, let’s get the Standard imports out of the way and load up the pretrained 
VGGNetl9 model (Example 10.6). 

Example 10.6 Loading the VGGNetl9 model for transfer learning 

# Load dependencies: 

from keras.applications.vggl9 import VGG19 

from keras.models import sequential 

from keras.layers import Dense, Dropout, Flatten 

from keras.preprocessing.image import ImageDataGenerator 

# Load the pre-trained VGG19 model: 

vggl9 = VGG19(include_top=False, 
weights=' imagenet’, 
input_shape=(224,224,3), 
pooli ng=None) 

# Freeze ali the layers in the base VGGNet19 model: 

for layer in vggl9.1ayers: 

1ayer.trainable = False 

Handily, Keras provides the network architecture and parameters (called weights, but in¬ 
cludes biases, too) already, so loading the pretrained model is easy. 46 Arguments passed to 
the VGG19 function help to define some characteristics ofthe loaded model: 

■ i ncl ude_top=Fal se specifies that we do not want the final dense classification 
layers from the original VGGNetl9 architecture. These layers were trained for clas- 
sifying the original ImageNet data. Rather, as you’ll see momentarily, we ll make 
our own top layers and train them ourselves using our own data. 

■ wei ghts= ’ i magenet ’ is to load model parameters trained on the 14 million-sample 
ImageNet dataset. 47 

■ i nput_shape= (224,224,3) initializes the model with the correct input image size 
to handle our hot dog data. 


46. For other pretrained Keras models, including the ResNet architecture we introduced earlier in this chapter, 
visit keras . i o/appl i cati ons. 

47. The only other wei ghts argument option at the time of writing is ' None', which would be a random initial- 
ization, but in the future, model parameters trained on other datasets could be available. 


W 
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After we load the model, a quick for loop traverses each layer in the model and sets its 
trai nabl e flag to Fal se so that die parameters in these layers will not be updated during 
training. We are confident that the convolutional layers of VGGNetl9 have been effec- 
tively trained to represent the generalized visual-imagery features of the large ImageNet 
dataset, so we leave the base model intact. 

In Example 10.7, we add fresh dense layers on top of the base VGGNetl9 model. 
These layers take the features extracted from the input image by the pretrained convo¬ 
lutional layers, and through training they will learn to use these features to classify the 
images as hot dogs or not hot dogs. 

Example 10.7 Adding classification layers to transfer-learning model 

# Instantiate the sequential model and add the VGG19 model: 

model = Sequenti al() 
model.add(vggl9) 

# Add the custom layers atop the VGG19 model: 

model.add(FIatten(name= 'f1attened' )) 
model.add(Dropout (0.5, name= 'dropout' )) 

model,add(Dense(2, acti vation= 'softmax' , name= 'predictions' )) 

# Compile the model for training: 

model.compi1e(optimizer=' adam' , loss='categorica1_crossentropy ' , 
metrics=[ 'accuracy'] ) 

Next, we use an instance ofthe ImageDataGenerator class to load the data 
(Example 10.8). This class is provided by Keras and serves to load images on the fly. 

It’s especially helpful if you don’t want to load ali of your training data into memory 
right away, or when you might want to perform random data augmentations in real time 
during training. 48 

Example 10.8 Defining data generators 

# Instantiate two image generator classes: 

train_datagen = ImageDataGenerator( 
rescale=1. 0/255 , 
data_format=' channels_l ast' , 
rotation_range=30, 
horizontal_fl i p=True , 
fi 1l_mode= 'ref1ect' ) 


48. In Chapter 9, we mention that data augmentation is an effective way to increase the size of a training dataset, 
thereby helping a model generalize to previously unseen data. 
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valid_datagen = ImageDataGenerator( 
rescale=1. 0/255 , 
data_format= 'channels_last' ) 

# Define the batch size: 
batch„si ze=32 

# Define the train and vali dation data generators: 

train_generator = train_datagen.f1ow_from_directory( 
directory=' ./hot-dog-not-hot-dog/train' , 
target_size=(224, 224), 
classes=[ ! hot_dog','not_hot_dog' ], 
class_mode=' categorical' , 
batch_size=batch_size, 
shuffle=True, 
seed=42) 

valid_generator = valid_datagen.flow_from_directory( 
directory=' ./hot-dog-not-hot-dog/test' , 
target_size=(224, 224), 
classes=[' hot_dog' , 'not_hot_dog' ], 
class_mode=' categorical' , 
batch_size=batch_si ze, 
shuffle=True, 
seed=42) 

The train-data generator will randomly rotate the images within a 30-degree range, ran- 
donily flip the images horizontally, rescale the data to between 0 and 1 (by multiplying by 
Y 255 ), and load the image data into arrays in the “channels last” format. 49 The validation 
generator only needs to rescale and load the images; data augmentation would be of no 
value there. Finally, the f 1 ow_f rom_di rectory () method directs each generator to load 
the images from a directory we specify. 30 The remainder of the arguments to this method 
should be intuitive. 

Now we’re ready to train (Example 10.9). Instead of using the f i t () method as we 
did in all previous cases of model-fitting in this book, here we call the fi t_generator () 
method on the model because we’ll be passing in a data generator in place of arrays of 
data. 51 During our run of this model, our best epoch turned out to be the sixth, in which 
we attained 81.2 percent accuracy. 


49. Look back at Example 10.6, and you’ll see that the model accepts inputs with dimensions of 224 X 224 X 3— 
that is, the channels dimension is last. The alternative is to set up the color channel as the first dimension. 

50. Instructions for downloading the data are included in our Jupyter notebook. 

51. As we warned earlier in this chapter, in the section on AlexNet and VGGNet, with very large models you may 
encounter out-of-memory errors. Please refer to bi t. ly/DockerMem for information on increasing the amount 
of memory available to your Docker Container. Alternatively, you could reduce your batch-size hyperparameter. 
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Example 10.9 Train transfer-learning model 

model,fit_generator(train_generator, steps_per_epoch=1 5 , 

epochs=1 6 , vali dation_data=valid_generator, 
vali dation_steps=15) 


This demonstrates the power of transfer learning. With a small amount of training and 
almost no time spent on architectural considerations or hyperparameter tuning, we have 
at our fingertips a model that performs reasonably well on a rather complicated image- 
classification task: hot dog Identification. With some time invested in hyperparameter 
tuning, the results could be improved further. 

Capsule Networks 

In 2017, Sara Sabour and her colleagues on Geoff Hintons (Figure 1.16) Google Brain 
team in Toronto made a splash with a novel concept called capsule networks, 52 Capsule 
networks have received considerable interest, because they are able to take positional 
information into consideration. CNNs, to their great detriment, do not; so a CNN 
would, for example, consider both of the images in Figure 10.16 to be a human face. The 
theory behind capsule networks is beyond the scope of this book, but machine vision 
practitioners are generally aware of them so we wanted to be sure you were, too. Today, 
they are too computationally intensive to be predominant in applications, but cheaper 
compute and theoretical advancements could mean that this situation will change soon. 



Figure 10.16 With convolutional neural networks, which are agnostic to the relative 
positioning of image features, the figure on the left and the one on the right are equally 
likely to be classified as Geoff Hinton’s face. Capsule networks, in contrast, take positional 
information into consideration, and so would be less likely to mistake the right-hand 

figure for a face. 


52. Sabour, S., et al. (2017). Dynamic routing between capsules. arXiv: 1710.09829. 
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Summary 


In this chapter, you learned about convolutional layers, which are specialized to detect 
spatial patterns, making them particularly useful for machine vision tasks. You incorpo- 
rated these layers into a CNN inspired by the classic LeNet-5 architecture, enabling you 
to surpass the handwritten-digit recognition accuracy of the dense networks you designed 
in Part II. The chapter concluded by discussing best practices for building CNNs and 
surveying the most noteworthy applications of machine vision algorithnis. In the coming 
chapter, you’ll discover that the spatial-pattern recognition capabilities of convolutional 
layers are well suited not only to machine vision but also to other tasks. 


Key Concepts 


Here are the essential foundational concepts thus far. New ternis from the current 
chapter are highlighted in purple. 


■ parameters: 

weight w 
■ bias b 


■ cost (loss) functions: 


■ quadratic (mean squared 


error) 


■ artificial neurons: 


■ activation a 


■ sigmoid 

■ tanh 

■ ReLU 


■ cross-entropy 

■ forward propagation 

■ backpropagation 

■ unstable (especially vanishing) 


■ linear 

■ input layer 

■ hidden layer 

■ output layer 

■ layer types: 


gradients 

■ Glorot weight initiahzation 

■ batch normalization 

■ dropout 

■ optimizers: 


■ stochastic gradient descent 


■ dense (fully connected) 

■ softmax 

■ convolutional 

■ max-pooling 

■ flatten 


■ Adam 

■ optimizer hyperparameters: 


■ learning rate rj 
batch size 



This page intentionally left blank 



11 

Natural Language Processing 


I n Chapter 2, we introduced computational representations of language, particularly 
highlighting word vectors as a potent approach for quantitatively capturing word mean- 
ing. In the present chapter, we cover code that will enable you to create your own word 
vectors as well as to provide them as an input into a deep learning model. 

The natural language processing models you build in this chapter will incorporate 
neural network layers weVe applied already: dense layers from Chapters 5 through 9, and 
convolutional layers from Chapter 10. Our NLP models will also incorporate new layer 
types—ones from the faniily of recurrent neural networks. RNNs natively handle Infor¬ 
mation that occurs in sequences such as natural language, but they can, in fact, handle 
any sequential data—such as fmancial time series or temperatures at a given geographic 
location—so theyre quite versatile. The chapter concludes with a section on deep learn¬ 
ing networks that process data via multiple parallel streams—a concept that dramatically 
widens the scope for creativity when you design your model architectures and, as you’ll 
see, can also improve model accuracy. 

Preprocessing Natural Language Data 

There are steps you can take to preprocess natural language data such that the modeling 
you carry out downstream may be more accurate. Common natural language preprocess¬ 
ing options include: 

■ Tokenization: This is the splitting of a document (e.g., a book) into a list of discrete 
elements of language (e.g., words), which we call tokens. 

■ Converting ali characters to lowercase : A capitalized word at the beginning of a sen- 
tence (e.g., She) has the same meaning as when it’s used later in a sentence ( she ). 

By converting all characters in a corpus to lowercase, we disregard any use of capi- 
talization. 

■ Removing stop words: These are frequently occurring words that tend to contain 
relatively little distinctive meaning, such as the, at, which, and of. There is no univer- 
sal consensus on the precise list of stop words, but depending on your application 

it may be sensible to ensure that certain words are (or aren’t!) considered to be stop 
words. For example, in this chapter, we’ll build a model to classify movie reviews as 
positive or negative. Sorne lists ofstop words include negations like didn’t, isn’t, and 
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wouldn’t that might be critical for our model to identify the sentiment of a movie 
review, so these words probably shouldift be removed. 

■ Kemoving punctuation: Punctuation marks generally don’t add much value to a 
natural language model and so are often removed. 

■ Stemming: 1 Stemming is the truncation of words down to their stem. For example, 
the words house and housing both have the stem hous. With smaller datasets in par- 
ticular, stemming can be productive because it pools words with similar meanings 
into a single token. There will be more examples of this stemmed tokens con- 
text, enabling techniques like word2vec or GloVe to more accurately identify an 
appropriate location for the token in word-vector space (see Figures 2.5 and 2.6). 

■ Handling n-grams: Some words commonly co-occur in such a way that the com- 
bination of words is better suited to being considered a single concept than several 
separate concepts. As examples, New York is a bigram (an /i-gram oflength two), 
and New York City is a trigram (an rc-gram oflength three). When chained together, 
the words new, york, and city have a specific meaning that might be better captured 
by a single token (and therefore a single location in word-vector space) than three 
separate ones. 

Depending on the particular task that we’ve designed our model for, as well as the 
dataset that we’re feeding into it, we may use ali, some, or none of these data preprocess- 
ing steps. As you consider applying any preprocessing step to your particular problem, 
you can use your intuition to weigh whether it might ultimately be valuable to your 
downstream task. We’ve already mentioned some examples of this: 

■ Stemming may be helpful for a small corpus but unhelpful for a large one. 

■ Likewise, converting all characters to lowercase is likely to be helpful when youre 
working with a small corpus, but, in a larger corpus that has many more examples 
ofindividual uses of words, the distinction of, say , general (an adjective meaning 
“widespread”) versus General (a noun meaning the commander of an army) may be 
valuable. 

■ Removing punctuation would not be an advantage in all cases. Consider, for exam¬ 
ple, if you were building a question-answering algorithm, which could use question 
marks to help it identify questions. 

■ Negations may be helpful as stop words for some classifiers but probably not for 
a sentiment classifier, for example. Which words you include in your list of stop 
words could be crucial to your particular application, so be careful with this one. In 
many instances, it will be best to remove only a limited number of stop words. 

If youre unsure whether a given preprocessing step may be helpful or not, you can 
investigate the situation empirically by incorporating the step and observing whether it 
impacts the accuracy of your deep learning model downstream. As a general rule, the 
larger a corpus becomes, the fewer preprocessing steps that will be helpful. With a small 
corpus, youre likely to be concerned about encountering words that are rare or that are 
outside the vocabulary ofyour training dataset. By pooling several rare words into a single 


1. Lemmatization, a more sophisticated alternative to stemming, requires the use of a reference vocabulary. For our 
purposes in this book, stemming is a sufficient approach for considering multiple related words as a single token. 
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common token, you’ll be more likely to train a model effectively on the meaning of the 
group of related words. As the corpus becomes larger, however, rare words and out-of- 
vocabulary words become less and less of an issue. With a very large corpus, then, it is 
likely to be helpful to avoid pooling several words into a single connnon token. Thafs 
because there will be enough instances of even the less-frequently-occurring words to 
effectively model their unique meaning as well as to model the relatively subde nuances 
between related words (that might otherwise have been pooled together). 

To provide practical examples ol these preprocessing steps in action, we invite you 
to check out our Natural Language Preprocessing Jupyter notebook. It begins by loading a 
number of dependencies: 

import nltk 

from nltk import word_tokenize, sent_tokenize 

from nltk. corpus import stopwords 

from nltk.stem.porter import * 

nltk.download( 'gutenberg' ) 

nltk.download( 'punkt' ) 

nltk.download( 'stopwords' ) 

import string 

import gensim 

from gensim.models.phrases import Phraser, Phrases 
from gensim.models.word2vec import Word2Vec 

from sklearn.manifol d import TSNE 

import pandas as pd 

from bokeh.io import output_notebook, output_file 
from bokeh.plotting import show, figure 
Ssmatplotlib inline 

Most of these dependencies are from nltk (the Natural Language Toolkit) and gensim 
(another natural language library for Python). We explain our use of each individual 
dependency when we apply it in the example code that follows. 

Tokenization 

The dataset we used in this notebook is a small corpus of out-of-copyright books from 
Project Gutenberg. 2 This corpus is available within nltk so it can be easily loaded using this 
code: 

from nltk.corpus import gutenberg 


2. Named after the printing-press inventor Johannes Gutenberg, Project Gutenberg is a source of tens of thousands 
of electronic books. These books are classic works of literature from across the globe whose Copyright has now 
expired, making them ffeely available. See gutenberg.org. 
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This wee corpus consists of a mere 18 literary works, including Jane Austens Emina , 
Lewis Carrolls A lice in Wonderland, and three plays by a little-known fellow named 
William Shakespeare. (Execute gutenberg . fi 1 ei ds () to print the names of ali 18 
documents.) By running 1 en (gutenberg . words ()), you can see that the corpus comes 
out to 2.6 million words—a manageable quantity that means you'11 be able to run ali of 
the code examples in this section on a laptop. 

To tokenize the corpus into a list of sentences, one option is to use nltks 
sent_tokeni ze( ) method: 

gberg_sent_tokens = sent_tokenize(gutenberg.raw()) 

Accessing the first element of the resulting list by running gberg_sent_tokens [0] , you 
can see that the first book in the Project Gutenberg corpus is Emma, because this first 
element contains the books title page, chapter markers, and first sentence, ali (erro- 
neously) blended together with newline characters (\n): 

'[Emma by Jane Austen 1816]\n\nV0LUME I\n\nCHAPTER I\n\n\nEmma Wood- 
house, handsome, elever, and rich, with a comfortable homeinand happy 
disposition, seemed to unite some of the best blessingsinof existence; 
and had lived nearly twenty-one years in the worldinwith very little to 
distress or vex her.' 

A stand-alone sentence is found in the second element, which you can view by exe- 
cutmg gberg_sent_tokens[1 ]: 

"She was the youngest of the two daughters of a most affectionate, 
\nindulgent father; and had, in consequence of her sister's mar- 
riage,\nbeen mistress of his house from a very early period." 

You can further tokenize this sentence down to the word level using nltks 
word_tokeni ze( ) method: 

word_tokenize(gberg_sent_tokens[1]) 

This prints a list of words with ali whitespace, including newline characters, stripped out 
(see Figure 11.1). The word father, for example, is the 15th word in the second sentence, 
as you can see by running this line of code: 

word_tokenize(gberg_sent_tokens [1 ] )[1 4] 

Although the sent_tokeni ze () and word_tokeni ze () methods may come in handy 
for working with your own natural language data, with this Project Gutenberg corpus, 
you can instead conveniently employ its built-in sents () method to achieve the same 
aims in a single step: 

gberg_sents = gutenberg.sents() 

This command produces gberg_sents, a tokenized list of lists. The higher-level list con¬ 
sists of individual sentences, and each sentence contains a lower-level list of words within 
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['She', 

1 was', 

'the ', 
'youngest', 
'of', 

'the', 

*two', 

'daughters', 
'of', 


'most’, 

' affectionate', 

' indulgent', 

' father', 

/ t 

' and', 

' had', 

• t 

'in', 

'consequence *, 
'of', 

'her', 

'sister', 


'marriage', 

' been’, 

'mistress ', 

'of', 

'his ', 

' house', 

’from', 

'a' f 
'very ', 

•early', 

'period', 

Figure 11.1 The second sentence of Jane Austen’s classic Emma tokenized 

to the word level 


it. Appropriately, the sents () method also separates the title page and chapter markers 
into their own individual elements, as you can observe with a call to gberg_sents [0:2]: 

[[’[', 'Emma', 'by', 'Jane', 'Austen 1 , '1816', ']'], 

['VOLUME', 'I'], 

['CHAPTER', 'I']] 

Because ofthis, the first actual sentence of Emma is now on its own as the fourth element 
of gberg_sents, and so to access the 15th word (father) in the second actual sentence, we 
now use gberg_sents [4] [14]. 

Converting AII Characters to Lowercase 

For the remaining natural language preprocessing steps, we begin by applying them itera- 
tively to a single sentence. As we wrap up the section later on, we’ll apply the steps across 
the entire 18-document corpus. 

Looking back at Figure 11.1, we see that this sentence begins with the capitalized 
word She. If we cl like to disregard capitalization so that this word is considered to be 
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identical to she, then we can use the Python 1 ower () method from the stri ng library, as 
shown in Example 11.1. 

Example 11.1 Converting a sentence to lowercase 

[w.lower() for w in gberg_sents [4] ] 

This line returns the same list as in Figure 11.1 with the exception that the first element 
in the list is now she instead of She. 

Removing Stop Words and Punctuation 

Another potential inconvenience with the sentence in Figure 11.1 is that it’s littered with 
both stop words and punctuation. To handle these, let’s use the + operator to concate¬ 
nate together nltks list ofEnglish stop words with the string library’s list of punctuation 
marks: 

stpwrds = stopwords.words(' english' ) + 1 i st (string.punctuation) 

Ifyou examine the stpwrds list that youve created, you’ll see that it contains many com- 
mon words that often don’t contain much particular meaning, such as a, an, and the. 3 
However, it also contains words like not and other negative words that could be critical if 
we were building a sentiment classifier, such as in the sentence, “This film was not good.” 

In any event, to remove ali of the elements of stpwrds from a sentence we could use a 
list cotnprehension 4 as we do in Example 11.2, which incorporates the lowercasing we used 
in Example 11.1. 

Example 11.2 Removing stop words and punctuation with a list comprehension 

[w.lower() for w in gberg_sents [4] if w.lower() not in stpwrds] 

Relative to Figure 11.1, running this line of code returns a much shorter list that now 
contains only words that each tend to convey a fair bit of meaning: 

['youngest', 

'two', 

'daughters' , 

'affectionate' , 

'indulgent' , 

'father' , 

'consequence' , 

'si ster', 

1 marriage', 

1 mistress ' , 


3. These three particular words are called artides, or determitiers. 

4. See bit.ly/li stComp if you’d like an introduction to list comprehensions in Python. 
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'house', 

'early', 

'period'] 

Stemming 

To stem words, you can use the Porter algorithm 5 provided by nltk. To do this, you cre¬ 
ate an instance of a PorterStemmer () object and then add its stem () method to the list 
comprehension you began in Example 11.2, as shown in Example 11.3. 

Example 11.3 Adding word stemming to our list comprehension 

[stemmer.stem(w.1ower()) for w in gberg_sents [4] 
if w.lower() not in stpwrds] 

This outputs the following: 

['youngest', 

'two', 

'daughter', 

'affection', 

'indulg ' , 

'father', 

'consequ' , 

'si ster', 

'marriag ' , 

'mistress ' , 

'hous', 

'earli', 

'period'] 

This is similar to our previous output of the sentence except that many of the words have 
been stemmed: 

1. daughters to daughter (allowing the plural and singular ternis to be treated identically) 

2. house to hous (allowing related words like house and housing to be treated as the same) 

3. early to earli (allowing differing tenses such as early, earlier , and earliest to be treated as 
the same) 

These stemming examples may be advantageous with a corpus as small as ours, because 
there are relatively few examples of any given word. By pooling similar words together, 
we obtain more occurrences of the pooled version, and so it may be assigned to a more 
accurate location in vector space (Figure 2.6). With a very large corpus, however, where 
you have many more examples of rarer words, there might be an advantage to treating 
plural and singular variations on a word differently, treating related words as unique, and 
retaining multiple tenses; the nuances could prove to convey valuable meaning. 


5. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130—7. 
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Handling n-grams 

To treat a bigram like New York as a single token instead of two, we can use the 
Phrases () and Phraser () methods from the gensim library. As demonstrated in 
Example 1 i.4, we use them in this way: 

1. Ph rases () to train a “detector” to identify how often any given pair of words 
occurs together in our corpus (the technical term for this is bigram collocation) relative 
to how often each word in the pair occurs by itself 

2. Phraser () to take the bigram collocations detected by the Phrases () object and 
then use this information to create an object that can efficiently be passed over our 
corpus, converting all bigram collocations from two consecutive tokens into a single 
token 

Example 11.4 Detecting collocated bigrams 

phrases = Phrases(gberg„sents) 
bigram = Phraser(phrases) 


By running bi gram. phrasegrams, we output a dictionary of the count and score of each 
bigram. The topmost lines ofthis dictionary are provided in Figure 11.2. 

Each bigram in Figure 11.2 has a count and a score associated with it. The bigram two 
daughters, for example, occurs a mere 19 times across our Gutenberg corpus. This bigram 
has a fairly low score (12.0), meaning the ternis two and daughters do not occur together 
very frequently relative to how often they occur apart. In contrast, the bigram Miss Taylor 
occurs more often (48 times), and the ternis Miss and Taylor occur much more frequently 
together relative to how often they occur on their own (score of 453.8). 


{(b'two', b'daughters'): (19, 11.966813731181546), 

(b'her', b'sister'): (195, 17.7960829227865), 

(b'”", b' s ' ) : (9781, 31.066242737744524), 

(b'very', b'early'): (24, 11.01214147275924), 

(b'Her', b'mother'): (14, 13.529425062715127), 

(b’long', b'ago'): (38, 63.22343628984788), 

(b'more', b'than'): (541, 29.023584433996874), 

(b'had', b'been'): (1256, 22.306024648925288), 
(b'an', b'excellent'): (54, 39.063874851750626), 
(b'Miss', b'Taylor'): (48, 453.75918026073305), 

(b'very', b'fond'): (28, 24.134280468850747), 

(b'passed', b'away'): (25, 12.35053642325912), 

(b'too', b'much'): (173, 31.376002029426687), 
(b'did', b'not'): (935, 11.728416217142811), 
(b'any', b'means'): (27, 14.096964108090186), 
(b'wedding', b'-'): (15, 17.4695197740113), 

(b'Her', b’father'): (18, 13.129571562488772), 

(b'after', b'dinner'): (21, 21.5285481168817), 

Figure 11.2 A dictionary of bigrams detected within our corpus 
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Scanning over the bigrams in Figure 11.2, notice that they are rnarred by capitalized 
words and punctuation marks. We’ll resolve those issues in the next section, but in the 
meantime lets explore how the bigram object we’ve created can be used to convert 
bigrams from two consecutive tokens into one. Lets tokenize a short sentence by using 
the spl i t () method on a string of characters wherever theres a space, as follows: 

tokenized_sentence = "Jon lives in New York City" .split() 

If we print tokeni zed_sentence, we output a list of unigrams only: [' Jon ' , ' 1 i ves ' , 
'in', ' New ' , ' York ' , ' Ci ty ' ]. If, however, we pass the list through our gensim 
bigram object by using bigram[tokenized_sentence] , the list then contains the bigram 
New York: [ ' Jon' , 'lives', 'in', 'New_York', 'City']. 


After youve identified bigrams across your corpus by running it through the bigram 
object, you can detect trigrams (such as New York City) by passing this new, bigram- 
lilled corpus through the Phrases () and Phraser () methods. This could be repeated 
again to identify 4-grams (and then again to identify 5-grams, and so on); however, 
there are diminishing returns from this. Bigrams (or at most trigrams) should suffice 
for the majority of applications. By the way, ifyou go ahead and detect trigrams with 
the Project Gutenberg corpus, New York City is unlikely to be detected. Our corpus 
of classic hterature doesn’t mention it often enough. 



Preprocessing the Full Corpus 

Having run through some examples of preprocessing steps on individual sentences, we 
now compose some code to preprocess the entire Project Gutenberg corpus. This will 
also enable us to collocate bigrams on a cleaned-up corpus that no longer contains capital 
letters or punctuation. 

Later on in this chapter, we ll use a corpus of film reviews that was curated by Andrew 
Maas and his colleagues at Stanford University to predict the sentiment of the reviews 
with NLP models. 6 During their data preprocessing steps, Maas and his coworkers de- 
cided to leave in stop words because they are “indicative of sentiment.” 7 They also de- 
cided not to stem words because they felt their corpus was sufficiently large that their 
word-vector-based NLP model “learns similar representations of words of the same stem 
when the data suggest it.” Said another way, words that have a similar meaning should 
find their way to a similar location in word-vector space (Figure 2.6) during model 
training. 

Following their lead, we’ll also forgo stop-word removal and stemming when prepro¬ 
cessing the Project Gutenberg corpus, as in Example 11.5. 


6. Maas, A., et al. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the 
Association for Computational Linguistics, 142—50. 

7. This is in line with our thinking, as we mentioned earlier in the chapter. 
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Example 11.5 Removing capitalization and punctuation from Project Gutenberg 
corpus 

lower_sents = [] 
for s in gberg_sents: 

1ower_sents.append([w.1ower() for w in s if w.1ower() 
not in i i st (string.punctuation)]) 

In this example, we begin with an empty list we call 1 ower_sents, and then we append 
preprocessed sentences to it using a for loop. 8 For preprocessing each sentence within 
the loop, we used a variation on the list comprehension from Example 11.2, in this case 
removing only punctuation marks while converting ali characters to lowercase. 

With punctuation and capitals removed, we can set about detecting collocated bigrams 
across the corpus afresh: 

lower_bigram = Phraser(Phrases(1ower_sents)) 

Relative to Example 11.4, this time we created our gensim "I ower_bi gram object in a 
single line by chaining the Phrases () and Phraser() methods together. The top ofthe 
output ofacallto 1 ower_bi gram . phrasegrams is provided in Figure 11.3: Comparing 
these bigrams with those from Figure 11.2, we do indeed observe that they are ali in 
lowercase (e.g., miss taylor) and bigrams that included punctuation marks are nowhere to 
be seen. 


{(b'two', b'daughters'): (19, 11.080802900992637), 

(b'her', b'sister'): (201, 16.93971298099339), 

(b'very’, b'early'): (25, 10.516998773665177), 

(b'her', b'mother'): (253, 10.70812618607742), 

(b'long', b'ago'): (38, 59.226442015336005), 

(b'more', b'than'): (562, 28.529926612065935), 

(b'had', b'been’): (1260, 21.583193129694834), 

(b'an', b'excellent'): (58, 37.41859680854167), 

(b'sixteen’, b'years'): (15, 131.42913000977515), 

(b'miss', b'taylor'): (48, 420.4340982546865), 

(b'mr', b'woodhouse'): (132, 104.19907841850323), 

(b’very’, b’fond'): (30, 24.185726346489627), 

(b'passed', b'away'): (25, 11.751473221742694), 

(b'too', b'much'): (177, 30.36309017383541), 

(b'did', b'not'): (977, 10.846196223896685), 

(b'any', b'means'): (28, 14.294148100212627), 

(b'after', b'dinner'): (22, 18.60737125272944), 

(b'mr', b'weston’): (162, 91.63290824201266), 

Figure 11.3 Sample of a dictionary of bigrams detected within our lowercased and 

punctuation-free corpus 



8. Ifyoure preprocessing a large corpus, we’d recoinmend using optimizable and parallelizable functional program- 
ming techniques in place of our simple (and therefore simple-to-follow) for loop. 
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Examining the results in Figure 11.3 further, however, it appears that the default min¬ 
imum thresholds for both count and score are far too liberal. That is, word pairs like 
two daughters and her sister should not be considered bigrams. To attain bigrams that we 
thought were more sensible, we experimented with more conservative count and score 
thresholds by increasing them by powers of2. Following this approach, we were generally 
satisfied by setting the optional Phrases () arguments to a min(imum) count of 32 and to a 
score threshold of 64, as shown in Example 11.6. 

Example 11.6 Detecting collocated bigrams with more conservative thresholds 

lower_bigram = Phraser(Phrases(lower_sents, 

min_count=32, threshold=64)) 


Although ifs not perfect, 9 because there are stili a few questionable bigrams like great dea1 
and few minutes, the output from a call to 1 ower_bi gram . phrasegrams is now largely 
defensible, as shown in Figure 11.4. 

Armed with our well-appointed 1 ower_bi gram object from Example 11.6, we can at 
last use a for loop to iteratively append for ourselves a corpus of cleaned-up sentences, as 
in Example 11.7. 

Example 11.7 Creating a “clean” corpus that includes bigrams 

clean_sents = [] 
for s in lower_sents: 

clean_sents.append(1ower_bigram[s]) 


{(b'miss', b'taylor'): (48, 156.44059469941823), 

(b'mr', b'woodhouse'): (132, 82.04651843976633), 

(b'mr', b'weston')s (162, 75.87438262077481), 

(b'mrs', b'weston'): (249, 160.68485093258923), 

(b'great', b*deal’): (182, 93.36368125424357), 

(b’mr', b'knightley'): (277, 161.74131790625913), 

(b'miss', b'woodhouse'): (173, 229.03802722366902), 

(b'years', b'ago’): (56, 74.31594785893046), 

(b'mr', b'elton'): (214, 121.3990121932397), 

(b'dare', b'say'): (115, 89.94000515807346), 

(b'frank', b'churchill'): (151, 1316.4456593286038), 

(b'miss’, b'bates'): (113, 276.39588291692513), 

(b'drawing', b'room'): (49, 84.91494947493561), 

(b'mrs', b'goddard'): (58, 143.57843432545658), 

(b'miss', b'smith'): (58, 73.03442128232508), 

(b'few', b'minutes'): (86, 204.16834974753786), 

(b'john', b'knightley'): (58, 83.03755747111268), 

(b'don', b't'): (830, 250.30957446808512), 

Figure 11.4 Sample of a more conservatively thresholded dictionary of bigrams 


9. These are statistical approximations, of course! 
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['sixteen', 

'years', 

'had', 

'miss_taylor’, 

'been', 

'in', 

'mr_woodhouse’, 

' s' , 

'family', 

'less', 

' as' , 

' a', 

'governess' , 

'than', 

'a' , 

'friend', 

'very', 

'fond', 

'of', 

'both', 

'daughters', 

'but', 

'particularly', 

'of' , 

' emina' ] 

Figure 11.5 Clean, preprocessed sentence from the Project Gutenberg corpus 


As an example, Figure 11.5 shows the seventh element of our clean corpus 

(cl ean_sents [6]), a sentence that includes the bigrams miss tayior and mr woodhouse. 


Creating Word Embeddings with word2vec 

With the cleaned corpus of natural language cl ean_sents now available to us, we are 
well positioned to embed words from the corpus into word-vector space (Figure 2.6). As 
you ll see in this section, such word embeddings can be produced with a single line of 
code. This single line of code, however, should not be executed blindly, and it has quite a 
few optional arguments to consider carefully. Given this, we’ll cover the essential theory 
behind word vectors before delving into example code. 

The Essential Theory Behind word2vec 

In Chapter 2, we provided an intuitive understanding of what word vectors are. We also 
discussed the underlying idea that because you can “know a word by the company it 
keeps” then a given words meaning can be well represented as the average of the words 
that tend to occur around it. word2vec is an unsupervised learning technique 10 —that is, 
it is applied to a corpus of natural language without making use of any labeis that may or 


10. See Chapter 4 for a recap of the differences between the supervised, unsupervised, and reinforcement learning 
problems. 
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may not happen to exist for the corpus. This means that any dataset of natural language 
could be appropriate as an input to word2vec. n 

When running word2vec, you can choose between two underlying model 
architectures— skip-gram (SG) or continuous bag of ivords (CBOW; pronounced see-bo)— 
either of which will typically produce roughly comparable results despite maximizing 
probabilities from “opposite” perspectives. To make sense of this, reconsider our toy-sized 
corpus from Figure 2.5: 

you shall know a word by the company it keeps 

In it, we are considering word to be the target word, and the three words to the right of 
it as well as the three words to the left ofit are considered to be context words. (This cor- 
responds to a window size of three words—one of the primary hyperparameters we must 
take into account when applying word2vec.) With the SG architecture, context words 
are predicted given the target word. 12 With CBOW, it is the inverse: The target word is 
predicted based on the context words. 13 

To understand word2vec more concretely, let s focus on the CBOW architecture in 
greater detail (although we equally could have focused on SG instead). With CBOW, 
the target word is predicted to be the average of ali the context words considered jointly. 
“Jointly” means “all at once”: The particular position of context words isn’t taken into 
consideration, nor whether the context word occurs before or after the target word. That 
the CBOW architecture has this attribute is right there in the “bag of words” part ofits 
name: 

■ We take all the context words within the Windows to the right and the left of the 
target word. 

■ We (figuratively!) throw all of these context words into a bag. If it helps you 
remember that the sequence of words is irrelevant, you can even imagine shak- 
ing up the bag. 

■ We calculate the average of all the context words contained in the bag, using this 
average to estimate what the target word could be. 


If we were concerned about syntax—the grammar of language (see Figure 2.9 for a 
refresher on the elements of natural language)—then word order would matter. But 
because with word2vec were concerned only with semantics—the meaning of words— 
it turns out that the order of context words is, on average, irrelevant. 


Having considered the intuitiveness of the “BOW” component of the CBOW 
moniker, lets also consider the “continuous” part ofit: The target word and context 


11. Mikolov, T., et al. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781. 

12. In more technical machine learning terms, the cost function of the skip-gram architecture is to maximize the 
log probability of any possible context word from a corpus given the current target word. 

13. Again, in technical ML jargon, the cost function for CBOW is maximizing the log probability of any possible 
target word from a corpus given the current context words. 
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Table 11.1 Comparison of word2vec architectures 


Architecture 

Predicts 

Relative Strengths 

Skip-gram (SG) 

Context words given 
target word 

Better for a smaller corpus; represents 
rare words well 

CBOW 

Target word given 
context words 

Multiple times faster; represents frequent 
words slightly better 


word Windows slide continuously one word at a time froni the ftrst word of the corpus ali 
the way through to the final word. At each position along the way, the target word is 
estimated given the context words. Via stochastic gradient descent, the location of words 
within vector space can be shifted, and thereby these target-word estimates can gradually 
be iniproved. 

In practice, and as summarized in Table 11.1, the SG architecture is a better choice 
when youre working with a small corpus. It represents rare words in word-vector space 
well. In contrast, CBOW is much more computationally efficient, so it is the better 
option when youre working with a very large corpus. Relative to SG, CBOW also 
represents frequently occurring words slightly better. 14 


Although word2vec is comfortably the most widely used approach for embedding 
words from a corpus of natural language into vector space, it is by no means the 
only approach. A major alternative to word2vec is GloVe—global vectors for word 
representation—which was introduced by the prominent natural language researchers 
Jeffrey Pennington, Richard Socher, and Christopher Manning. 15 At the time—in 
2014—the three were colleagues working together at Stanford University. 

GloVe and word2vec differ in their underlying methodology: word2vec uses pre- 
dictive models, while GloVe is count based. Ultimately, both approaches tend to pro¬ 
duce vector-space embeddings that perform similarly in downstream NLP applications, 
with some research suggesting that word2vec may provide modestly better results in 
select cases. One potential advantage of GloVe is that it was designed to be parallehzed 
over multiple processors or even multiple machmes, so it might be a good option if 
youre looking to create a word-vector space with many unique words and a very large 
corpus. 


14. Regardless of whether you use the SG or CBOW architecture, an additional option you have while running 
word2vec is the training method. For this, you have two different options: hierarchical softmax and negative sampling. 
The former involves normalization and is better suited to rare words. The latter, on the other hand, forgoes nor- 
malization, making it better suited to common words and low-dimensional word-vector spaces. For our purposes 
in this book, the differences between these two training methods are insignificant and we don’t cover them further. 

15. Pennington, J., et al. (2014). GloVe: Global vectors for word representations. Proceedings of the Conference on 
Empirical Methods in Natural Language Processing. 
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The contemporary leading alternative to both word2vec and GloVe is 
fastText. 16 ’ 17,18 ’ 19 This approach was developed by researchers at Facebook. A ma¬ 
jor benefit of fastText is that it operates on a subword level—its “word” vectors are 
actuaUy subcomponents of words. This enables fastText to work around some of the 
issues related to rare words and out-of-vocabulary words addressed in the preprocessing 
section at the outset of this chapter. 


Evaluating Word Vectors 

However you create your word vectors—be it with word2vec or an alternative 
approach—there are two broad perspectives you can consider when evaluating the quality 
of word vectors: intrinsic and extrinsic evaluations. 

Extrinsic evaluations involve assessing the performance of your word vectors within 
whatever your downstream NLP application of interest is—your sentiment-analysis clas- 
sifier, say, or perhaps your named-entity recognition tool. Although extrinsic evaluations 
can take longer to carry out because they require you to carry out all of your downstream 
Processing steps—including perhaps training a computationally intensive deep learning 
model—you can be confident that it s worthwhile to retain a change to your word vectors 
if they relate to an appreciable improvement in the accuracy of your NLP application. 

In contrast, intrinsic evaluations involve assessing the performance of your word 
vectors not on your final NLP application, but rather on some specific intermediate sub- 
task. One common such task is assessing whether your word vectors correspond well to 
arithmetical analogies like those shown in Figure 2.7. For example, ifyou start at the 
word-vector location for ki ng, subtract man, and add woman, do you end up near the 
word-vector location for queen ? 211 

Relative to extrinsic evaluations, intrinsic tests are quick. They may also help you bet- 
ter understand (and therefore troubleshoot) intermediate steps within your broader NLP 
process. The limitation of intrinsic evaluations, however, is that they may not ultimately 
lead to improvements in the accuracy of your NLP application downstream unless you Ve 
identified a reliable, quantifiable relationship between performance on the intermediate 
test and your NLP application. 

Running word2vec 

As mentioned earlier, and as shown in Example 11.8, word2vec can be run in a single line 
of code—albeit with quite a few arguments. 


16. The open-source fastText library is available at fasttext. cc. 

17. Joulin, A., et al. (2016). Bag of tricks for efficient text classification. arXiv: 1607.01759. 

18. Bojanowski, P., et al. (2016). Enriching word vectors with subword information. arXiv: 1607.04606. 

19. Note that the lead author of the landmark word2vec paper, Tomas Mikolov, is the final author of both of these 
landmark fastText papers. 

20. A test set of 19,500 such analogies was developed by Tomas Mikolov and his colleagues in their 2013 word2vec 
paper. This test set is available at download.tensorflow.org/data/questions-words.txt. 
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Example 11.8 Running word2vec 

model = Word2Vec(sentences=clean_sents, size=64, 
sg=1, window=10, iter=5, 
min_count=1 0 , workers=4) 

Heres a breakdown of each of the arguments we passed into the Word2Vec () method 
from the gensim library: 

■ sentences: Pass in a list oflists like clean_sents as a corpus. Elements in the 
higher-level list are sentences, whereas elements in the lower-level list can be word- 
level tokens. 

■ si ze: The number of dimensions in the word-vector space that will resuit from 
running word2vec. This is a hyperparameter that can be varied and evaluated 
extrinsically or intrinsically. Like other hyperparameters in this book, there is a 
Goldilocks sweet spot. You can horne in on an optimal value by specifying, say, 

32 dimensions and varying this value by powers of 2. Doubling the number of 
dimensions will double the computational complexity of your downstream deep 
learning model, but if doing this results in markedly higher model accuracy then 
this extrinsic evaluation suggests that the extra complexity could be worthwhile. 

On the other hand, halving the number of dimensions halves computational com¬ 
plexity downstream: If this can be done without appreciably decreasing your NLP 
models accuracy, then it should be. By performing a handful of intrinsic inspec- 
tions (which we’ll go over shortly), we found 64 dimensions to provide more 
sensible word vectors than 32 dimensions for this particular case. Doubling this 
figure to 128, however, provided no noticeable improvement. 

■ sg: Set to 1 to choose the skip-gram architecture, or leave at the 0 default to 
choose CBOW As summarized in Table 11.1, SG is generally better suited to 
small datasets like our Gutenberg corpus. 

■ wi ndow: For SG, a window size of 10 (for a total of 20 context words) is a good 
bet, so we set this hyperparameter to 10. Ifwe were using CBOW, then a window 
size of 5 (for a total of 10 context words) could be near the optimal value. In either 
case, this hyperparameter can be experimented with and evaluated extrinsically 

or intrinsically. Small adjustments to this hyperparameter may not be perceptibly 
impactful, however. 

■ i ter: By default, the gensim Word2Vec ( ) method iterates over the corpus fed into 
it (i.e., slides over ali ofthe words) five times. Multiple iterations ofword2vec is 
analogous to multiple epochs of training a deep learning model. With a small cor¬ 
pus like ours, the word vectors improve over several iterations. With a very large 
corpus, on the other hand, it might be cripplingly computationally expensive to run 
even two iterations—and, because there are so many examples of words in a very 
large corpus anyway, the word vectors might not be any better. 

■ mi n_count: This is the minimum number of times a word must occur across the 
corpus in order to fit it into word-vector space. II a given target word occurs only 
once or a few times, there are a limited number of examples of its contextual words 
to consider, and so its location in word-vector space may not be reliable. Because of 
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this, a minimum count ofabout 10 is often reasonable. The higher the count, the 
smaller the vocabulary of words that will be available to your downstream NLP task. 
This is yet another hyperparameter that can be tuned, with extrinsic evaluations 
likely being more illuminating than intrinsic ones because the size of the vocabulary 
you have to work with could make a considerable impact on your downstream 
NLP application. 

■ workers: This is the number of processing cores you’d like to dedicate to training. 

If the CPU on your machine has, say, eight cores, then eight is the largest number 
of parallel worker threads you can have. In this case, if you choose to use fewer than 
eight cores, youre leaving compute resources available for other tasks. 

In our GitHub repository, we saved our model using the save () method of word2vec 
objects: 

model,save( 'clean_gutenberg_model,w2v' ) 

Instead of running word2vec yourself, then, youre welcome to load up our word vectors 
using this code: 

model = gensim.models.Word2Vec.1oad( 'clean_gutenberg_model.w2v' ) 


If you do choose the word vectors we created, then the following examples 
will produce the same outputs. 21 We can see the size of our vocabulary by calling 
1 en (model . wv. vocab) . This telis us that there are 10,329 words (well, more specifi- 
cally, tokens) that occur at least 10 times within our cl ean_sents corpus. 22 One of the 
words in our vocabulary is dog. As shown in Figure 11.6, we can output its location in 
64-dimensional word-vector space by running model . wv [ ' dog ' ]. 


arrayd 0.38401067, 

0.01232518, 

-0.37594706, 

-0.00112308, 

0.38663676, 

0.01287549, 

0.398965 , 

0.0096426 , 

-0.10419296, 

-0.02877572, 

0.3207022 , 

0.27838793, 

0.62772304, 

0.34408906, 

0.23356602, 

0.24557391, 

0.3398472 , 

0.07168821, 

-0.18941355, 

-0.10122284, 

-0.35172758, 

0.4038952 , 

-0.12179806, 

0.096336 , 

0.00641343, 

0.02332107, 

0.7743452 , 

0.03591069, 

-0.20103034, 

-0.1688079 , 

-0.01331445, 

-0.29832968, 

0.08522387, 

-0.02750671, 

0.32494134, 

-0.14266558, 

-0.4192913 , 

-0.09291836, 

-0.23813559, 

0.38258648, 

0.11036541, 

0.005807 , 

-0.16745028, 

0.34308755, 

-0.20224966, 

-0.77683043, 

0.05146591, 

-0.5883941 , 

-0.0718769 , 

-0.18120563, 

0.00358319, 

-0.29351747, 

0.153776 , 

0.48048878, 

0.22479494, 

0.5465321 , 

0.29695514, 

0.00986911, 

-0.2450937 , 

-0.19344331, 

0.3541134 , 

0.3426432 , 

-0.10496043, 

0.00543602] 

, dtype=float32) 


Figure 11.6 The location of the token “dog” within the 64-dimensional word-vector 
space we generated using a corpus of books from Project Gutenberg 


21. Every time word2vec is run, the initial locations of every word of the vocabulary within word-vector space 
are assigned randomly. Because of this, the same data and arguments provided to Word2Vec() will nevertheless 
produce unique word vectors every time, but the semantic relationships should be similar. 

22. Vocabulary size is equal to the number of tokens from our corpus that had occurred at least 10 times, because 
we set mi n_count=10 when calling Word2Vec () in Example 11.8. 
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As a rudimentary intrinsic evaluation of the quality of our word vectors, we can use 
the most_si mi 1 ar () method to confirm that words with similar meanings are found in 
similar locations within our word-vector space. 23 For example, to output the three words 
that are most similar to father in our word-vector space, we can run this code: 

model,wv.most_similar( 'father' , topn=3) 

This outputs the following: 

[('mother', 0.8257375359535217), 

('brother', 0.7275018692016602), 

('sister', 0.7177823781967163)] 

This output indicates that mother, brother, and sister are the most similar words to father in 
our word-vector space. In other words, within our 64-dimensional space, the word that is 
closest 24 to father is the word mother. Table 11.2 provides some additional examples of the 
words most similar to (i.e., closest to) particular words that we’ve picked from our word- 
vector vocabulary, ali five of which appear pretty reasonable given our small Gutenberg 
corpus. 25 

Suppose we run the following line of code: 

model .wv.doesntjiatch ("mother father sister brother dog" .split()) 

We get the output dog, indicating that dog is the least similar relative to ali the other pos- 
sible word pairs. We can also use the following line to observe that the similarity score 
between father and dog is a mere 0.44: 

model.wv.si mi 1 arity( 'father' , 'dog' ) 


Table 11.2 The words most similar to select test words from our Project 
Gutenberg vocabulary 


Test Word 

Most Similar Word 

Cosine Similarity Score 

father 

mother 

0.82 

dog 

puppy 

0.78 

eat 

drink 

0.83 

day 

morning 

0.76 

ma_am 

madam 

0.85 



23. Technically speaking, the similarity between two given words is computed here by calculating the cosine 
similarity. 

24. That is, has the shortest Euclidean distance in that 64-dimensional vector space. 

25. Note that the final test word in Table 11.2— ma’am —is only available because of the bigram collocation (see 
Examples 11.6 and 11.7). 
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This similarity score of 0.44 is much lower than the similarity between father and any of 
mother, brother, or sister, and so it’s unsurprising that dog is relatively distant from the other 
four words within our word-vector space. 

As a final little intrinsic test, we can compute word-vector analogies as in Figure 2.7. 
For example, to calculate Vf a ther ~ v man + v woman , we can execute this code: 

model.wv.most_simi1ar(positive=[ father' , 'woman' ], negative=[ 'man' ]) 

The top-scoring word comes out as mother, which is the correct answer to the analogy. 
Suppose we likewise execute this code: 

model.wv.most_simi1ar(positive=[ 'husband' , 'woman' ], negative=[ 'man' ]) 

In this case, the top-scoring word comes out as wi f e, again the correct answer, thereby 
suggesting that our word-vector space may generally be on the right track. 


A given dimension within an n-dimensional word-vector space does not necessarily 
represent any specific factor that relates words. For example, although the real-world 
difFerences in meaning of gender or verb tense are represented by some vector direc- 
tion (i.e., some movement along some combination of dimensions) within the vector 
space, this meaningful vector direction may only by chance be aligned—or perhaps 
correlated—with a particular axis of the vector space. 

This contrasts with some other approaches that involve n-dimensional vector 
spaces, where the axes are intended to represent some specific explanatory variable. 
One such approach that many people are familiar with is principal component anal- 
ysis (PCA), a technique for identifying linearly uncorrelated (i.e., orthogonal) vectors 
that contribute to variance in a given dataset. A corollary of this difference between 
information stored as points in PCA versus in word-vector space is that in PCA, the 
first principal components contribute most of the variance, and so you can focus on 
them and ignore later principal components; but in a word-vector space, ali of the 
dimensions may be important and need to be taken into consideration. In this way, 
approaches like PCA are useful for dimensionality reduction because we do not need 
to consider all of the dimensions. 


Plotting Word Vectors 

Human brains are not well suited to visualizing anything in greater than three dimen¬ 
sions. Thus, plotting word vectors—which could have dozens or even hundreds of 
dimensions—in their native format is out of the question. Thankfully, we can use tech- 
niques for dimensionality reduction to approximately map the locations of words from high- 
dimensional word-vector space down to two or three dimensions. Our recommended 
approach for such dimensionality reduction is t-distributed stochastic neighbor embedding 
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(t-SNE; pronounced fee-snee), which was developed by Laurens van der Maaten in col- 
laboration with GeoffHinton (Figure 1.16). 26 

Example 11.9 provides the code from our Natural Language Preprocessing notebook 
for reducing our 64-dimensional Project Gutenberg-derived word-vector space down 
to two dimensions, and then storing the resulting x and y coordinates within a Pandas 
DataFrame. There are two arguments for the TSNE() method (from the scikit-leam library) 
that we need to focus on: 

■ n_components is the number of dimensions that should be returned, so setting 
this to 2 results in a two-dimensional output, whereas 3 would resuit in a three- 
dimensional output. 

■ n_i ter is the number of iterations over the input data. As with word2vec 
(Example 11.8), iterations are analogous to the epochs associated with training a 
neural network. More iterations corresponds to a longer training time but may 
improve the results (although only up to a point). 

Example 11.9 t-SNE for dimensionality reduction 

tsne = TSNE(n_components=2, n_iter=1 000) 

X_2d = tsne.fit_transform(model,wv[model.wv.vocab]) 
coords_df = pd.DataFrame(X_2d, coiumns=[ x ' , 'y' ]) 
coords_df[ 1 token' ] = model.wv.vocab.keys() 

Running t-SNE as in Example 11.9 may take sonte time on your machine, so youre 
welcome to use our results if youre feeling impatient by running the following code: 27 2x 

coords__df = pd.read_csv(' clean_gutenberg_tsne.csv' ) 

Whether you ran t-SNE to produce coords__df on your own or you loaded in ours, you 
can check out the first few lines of the DataFrame by using the head () method: 

coords_df.head() 

Our output from executing head () is shown in Figure 11.7. 

Example 11.10 provides code for creating a static scatterplot (Figure 11.8) ofthe 
two-dimensional data we created with t-SNE (in Example 11.9). 

Example 11.10 Static two-dimensional scatterplot of word-vector space 

_ = coords_df.plot.scatter('x', 'y', figsize=(12,12), 

marker='.', s=10, alpha=0.2) 


26. van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 
9, 2579-605. 

27. We created this CSV after running t-SNE on our word-vectors using this command: 
coords_df.to_csv('clean_gutenberg_tsne.csv', index=False) 

28. Note that because t-SNE is stochastic, you will obtain a unique resuit every time you run it. 
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X 

y 

token 

0 

62.494060 

8.023034 

emma 

1 

8.142986 

33.342200 

by 

2 

62.507140 

10.078477 

jane 

3 

12.477635 

17.998343 

volume 

4 

25.736960 

30.876250 

i 


Figure 11.7 This is a Pandas DataFrame containing a two-dimensional representation 
of the word-vector space we created from the Project Gutenberg corpus. Each unique 
token has an x and y coordinate. 
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Figure 11.8 Static two-dimensional word-vector scatterplot 
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On its own, the scatterplot displayed in Figure 11.8 may look interesting, but theres 
little actionable information we can take away from it. Instead, we recommend using 
the bokeh library to create a highly interactive—and actionable—plot, as with the code 
provided in Example 11.11, 29 

Example 11.11 Interactive bokeh plot of two-dimensional word-vector data 

output_notebook() 

subset_df = coords_df.sample(n=5000) 
p = figure(plot_width=800, plot_height=800) 

_ = p.text(x=subset_df.x, y=subset_df.y, text=subset_df.token) 
show(p) 

The code in Example 11.11 produces the interactive scatterplot in Figure 11.9 using the x 
and y coordinates generated using t-SNE. 



29. In Example 11.11, we used the Pandas sampl e () method to reduce the dataset down to 5,000 tokens, because 
we found that using more data than this corresponded to a clunky user experience when using the bokeh plot 
interactively. 
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Figure 11.10 Clothing words from the Project Gutenberg corpus, revealed by zooming 
in to a region of the broader bokeh plot from Figure 11.9 


By toggling the Wheel Zoom button in the top-right corner of the plot, you can use 
your mouse to zoom into locations within the cloud so that the words become legible. 

For example, as shown in Figure 11.10, we identified a region composed largely ofitems 
of clothing, with related clusters nearby, including parts of the human anatomy, colors, 
and fabric types. Exploring in this way provides a largely subjective intrinsic evaluation of 
whether related ternis—and particularly synonyms—cluster together as you’d expect them 
to. Doing similar, you may also notice particular shortcomings of your natural-language 
preprocessing steps, such as the inclusion of punctuation marks, bigrams, or other tokens 
that you may prefer weren’t included within your word-vector vocabulary. 

The Area under the ROC Curve 


Our apologies for interrupting the fun, interactive plotting of word vectors. We need 
to take a brief break from natural language-specific content here to introduce a metric 
that will come in handy in the next section of the chapter, when we will evaluate the 
performance of deep learning NLP models. 
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Up to this point in the book, most of our models have involved multiclass outputs: 
When working with the MNIST digits, for example, we used 10 output neurons to rep- 
resent each of the 10 possible digits that an input image could represent. In the remaining 
sections of this chapter, however, our deep learning models will be binary classifiers: They 
will distinguish between only two classes. More specifically, we will build binary classifiers 
to predict whether the natural language of film reviews corresponds to a favorable review 
or negative one. 

Unlike artificial neural networks tasked with multiclass problems, which require as 
many output neurons as classes, ANNs that are acting as binary classifiers require only a 
single output neuron. This is because there is no extra information associated with having 
two output neurons. If a binary classifier is provided some input x and it calculates some 
output y for one of the classes, then the output for the other class is simply 1 — y. As an 
example, if we feed a movie review into a binary classifier and it outputs that the proba- 
bility that this review is a positive one is 0.85, then it must be the case that the probability 
of the review being negative is 1 — 0.85 = 0.15. 

Because binary classifiers have a single output, we can take advantage of metrics for 
evaluating our models performance that are sophisticated relative to the excessively 
black-and-white accuracy metric that dominates multiclass problems. A typical accuracy 
calculation, for example, would contend that if y > 0.5, then the model is predicting that 
the input x belongs to one class, whereas if it outputs anything less than 0.5, it belongs 
to the other class. To illustrate why having a specific binary threshold like this is overly 
simplistic, consider a situation where inputting a movie review results in a binary classifier 
outputting y = 0.48: A typical accuracy calculation threshold would hold that—because 
this y is lower than 0.5—it is being classed as a negative review. If a second film review 
corresponds to an output of y = 0.51, the model has barely any more confidence that 
this review is positive relative to the first review. Yet, because 0.51 is greater than the 0.5 
accuracy threshold, the second review is classed as a positive review. 

The starkness of the accuracy metric threshold can hide a fair bit of nuance in the 
quality of our models output, and so when evaluating the performance of binary classi¬ 
fiers, we prefer a metric called the area under the curve of the receiver operating characteristic. 

The ROC AUC, as the metric is known for short, has its roots in the Second World War, 
when it was developed to assess the performance of radar engineers’ judgment as they 
attempted to identify the presence of enemy objects. 

We like the ROC AUC for two reasons: 

1. It blends together two useful metrics— true positive rate and false positive rate —into a 
single summary value. 

2. It enables us to evaluate the performance of our binary classifiers output across the 
full range of y, from 0.0 to 1.0. This contrasts with the accuracy metric, which 
evaluates the performance of a binary classifier at a single threshold value only— 
usually y = 0.5. 

The Confusiori Matrix 

The first step toward understanding how to calculate the ROC AUC metric is to under- 
stand the so-called confusiori matrix, which—as you'll see—isn’t actually all that confusing. 
Rather, the matrix is a straightforward 2x2 table of how confused a model (or, as back in 
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Table 11.3 A confusiori matrix 


actual y 

1 0 


predicted y 


True positive 
False negative 


False positive 
True negative 


WWII, a person) is while attenipting to act as a binary classifier. You can see an example 
of a confusion matrix in Table 11.3. 

To bring the confusion matrix to life with an example, let’s return to the hot dog / 
not hot dog binary classifier that we’ve used to construet silly examples over many of the 
preceding chapters: 

■ When we provide some input tc to a model and it predicts that the input represents 
a hot dog, then we’re dealing with the first row of the table, because the predicted 
y = 1. In that case, 

■ True positive: If the input is actually a hot dog (i.e., actual y = 1), then the 
model correctly classified the input. 

■ False positive: If the input is actually not a hot dog (i.e., actual y = 0), then 
the model is confused. 

■ When we provide some input a: to a model and it predicts that the input does not 
represent a hot dog, then were dealing with the second row of the table, because 
predicted y = 0. In that case, 

■ False negative: Ifthe input is actually a hot dog (i.e., actual y = 1), then the 
model is also confused in this circumstance. 

■ True negative: Ifthe input is actually not a hot dog (i.e., actual y = 0), then 
the model correctly classified the input. 

Calculating the ROC AUC Metric 

Briefed on the confusion matrix, we can now move forward and calculate the ROC AUC 
metric itself, using a toy-sized example. Let’s say, as shown in Table 11.4, we provide four 
inputs to a binary-classification model. Two ofthese inputs are actually hot dogs (y = 1), 
and two of them are not hot dogs (y = 0). For each of these inputs, the model outputs 
some predicted y, ali four ofwhich are provided in Table 11.4. 

To calculate the ROC AUC metric, we consider each of the y values output by the 
model as the binary-classification threshold in turn. Let’s start with the lowest y , which is 
0.3 (see the “0.3 threshold” column in Table 11.5). At this threshold, only the first input 

Four hot dog / not hot dog predictions 

y 9 


0 

0.3 

1 

0.5 

0 

0.6 

1 

0.9 


Table 11.4 
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Table 11.5 Four hot dog / not hot dog predictions, now with intermediate ROC 
AUC calculations 


y 

y 

0.3 threshold 

0.5 threshold 

0.6 threshold 

0 (not hot dog) 

0.3 

0 (TN) 

0 (TN) 

0 (TN) 

1 (hot dog) 

0.5 

1 (TP) 

0 (FN) 

0 (FN) 

0 (not hot dog) 

0.6 

1 (FP) 

1 (FP) 

0 (TN) 

1 (hot dog) 

0.9 

1 (TP) 

1 (TP) 

1 (TP) 

True Positive Rate 

TP 

TP+FN 

2 + 0 “1-0 

ITT ='0-5 

l+T = 0-5 

False Positive Rate 

FP 

FP+TN 

ttt = 0-5 

ITT = 0-5 

0T2=°-° 


is classed as not a hot dog, whereas the second through fourth inputs (ali with y > 0.3) 
are ali classed as hot dogs. We can compare each of these four predicted classifications 
with the confusiori matrix in Table 11.3: 

1. True negative (TN): This is actually not a hot dog (y = 0) and was correctly pre¬ 
dicted as such. 

2. True positive (TP): This is actually a hot dog (y = 1) and was correctly predicted 
as such. 

3. False positive (FP): This is actually not a hot dog (y = 0) but it was erroneously 
predicted to be one. 

4. True positive (TP): Like input 2, this is actually a hot dog (y = 1) and was correctly 
predicted as such. 

The same process is repeated with the classification threshold set to 0.5 and yet again with 
the threshold set to 0.6, allowing us to populate the remaining columns of Table 11.5. As 
an exercise, it might be wise to work through these two columns, comparing the classifi¬ 
cations at each threshold with the actual y values and the confusion matrix (Table 11.3) 
to ensure that you have a good handle on these concepts. Finally, note that the highest 
y value (in this case, 0.9) can be skipped as a potential threshold, because at such a high 
threshold we’d be considering ali four instances to not be hot dogs, making it a ceiling 
instead of a classification boundary. 

The next step toward computing the ROC AUC metric is to calculate both the true 
positive rate (TPR) and the false positive rate (FPR) at each of the three thresholds. 
Equations 11.1 and 11.2 use the “0.3 threshold” column to provide examples ofhow to 
calculate the true positive rate and false positive rate, respectively. 


True Positive Rate = 


(TP count) 


(TP count) 
2 

2 + 0 

2 

2 

1.0 


+ (FN count) 


( 11 . 1 ) 
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False Positive Rate = 


( FP count ) 

(FP count ) + ( TN count) 
1 


1 + 1 
_ 1 

“ 2 

= 0.5 


( 11 . 2 ) 


Shorthand versions of the arithmetic for calculating TPR and FPR for the thresholds 
0.5 and 0.6 are also provided for your convenience at the bottom of Table 11.5. Again, 
perhaps you should test if you can compute these values yourself on your own time. 

The fmal stage in calculating ROC AUC is to create a plot like the one we provide in 
Figure 11.11. The points that make up the shape of the receiver operating characteristic 
(ROC) curve are the false positive rate (horizontal, x-axis coordinate) and true positive 
rate (vertical, y-axis coordinate) at each of the available thresholds (which in this case is 
three) in Table 11.5, plus two extra points in the bottom-left and top-right corners of the 
plot. Specifically, these five points (shown as orange dots in Figure 11.11) are: 

1. (0, 0) for the bottom-left corner 

2. (0, 0.5) from the 0.6 threshold 

3. (0.5, 0.5) from the 0.5 threshold 

4. (0.5, 1) from the 0.3 threshold 

5. (1, 1) for the top-right corner 

In this toy-sized example, we only used four distinet y values, so there are only five 
points that determine the shape of the ROC curve, making the curve rather step shaped. 
When there are many available predictions providing many distinet y values—as is typ- 
ically the case in real-world examples—the ROC curve has many more points, and so 
it’s much less step shaped and much more, well, curve shaped. The area under the curve 
(AUC) ofthe ROC curve is exactly what it sounds like: In Figure 11.11, we’ve shaded 
this area in orange and, in this example, the AUC constitutes 75 percent of ali the possible 
area and so the ROC AUC metric comes out to 0.75. 


true 

positive 

rate 


0 - 1 - 

0 0.5 1 

false positive rate 


Figure 11.11 The (orange-shaded) area under the curve of the receiving operator 
characteristic, determined using the TPRs and FPRs from Table 11.5 
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A binary classifier that works as well as chance will generate a straight diagonal running 
from the bottom-left corner of the plot to its top-right corner, so an ROC AUC of 0.5 
indicates that the classifier works as well as flipping a coin. A perfect ROC AUC is 1.0, 
which is attained by having FPR = 0 and TPR = 1 across all of the available y thresholds. 
When youre designing a binary classifier to perform well on the ROC AUC metric, 
the goal is thus to minimize FPR and maxiinize TPR across the range of y thresholds. 
That said, for most problems you encounter, attaining a perfect ROC AUC of 1.0 is not 
possible: There is usually some noise—perhaps a lot of noise—in the data that makes 
perfection unattainable. Thus, when youre working with any given dataset, there is some 
(typically unknown!) maximum ROC AUC score, such that no matter how ideally suited 
your model is to act as a binary classifier for the problem, there’s an ROC AUC ceiling 
that no model can crack through. 

Over the remainder of this chapter we use the illuminating ROC AUC metric, along- 
side the simpler accuracy and cost metrics you are already acquainted with, to evaluate the 
performance of the binary-classifying deep learning models that we design and train. 

Natural Language Classification with Familiar 
Networks 


In this section, we tie together concepts that were introduced in this chapter—natural 
language preprocessing best practices, the creation of word vectors, and the ROC AUC 
metric—with the deep learning theory from previous chapters. As we already alluded to 
earlier, the natural language processing model you'11 experiment with over the remainder 
of the chapter will be a binary classifier that predicts whether a given film review is a 
positive one or a negative one. We begin by classifying natural language documents using 
types of neural networks that youre already familiar with—dense and convolutional— 
before moving along to networks that are specialized to handle data that occur in a 
sequence. 

Loadingthe IMDb Film Reviews 

As a performance baseline, we’ll initially train and test a relatively simple dense network. 
All ofthe code for doing this is provided within our Dense Sentiment Classifier Jupyter 
notebook. 

Example 11.12 provides the dependencies we need for our dense sentiment classifier. 
Many of these dependencies will be recognizable from previous chapters, but others (e.g., 
for loading a dataset of film reviews, saving model parameters as we train, calculating 
ROC AUC) are new. As usual, we cover the details of these dependencies as we apply 
them later on. 

Example 11.12 Loading sentiment classifier dependencies 

import keras 

from keras.datasets import imdb # new! 

from keras.preprocessing.sequence import pad_sequences # new! 
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from keras.models import Sequentia! 

from keras.iayers import Dense, Flatten, Dropout 

from keras.iayers import Embedding # new! 

from keras.cai1backs import ModelCheckpoint # new! 

import os # new! 

from sklearn.metrics import roc_auc_score, roc_curve # new! 
import pandas as pd 

import matplotlib.pyplot as plt # new! 

%matplotlib inline 


It s a good programming practice to put as many hyperparameters as you can at the 
top ofyour file. This makes it easier to experiment with these hyperparameters. It also 
makes it easier for you (or, indeed, your colleagues) to understand what you were doing 
in the file when you return to it (perhaps much) later. With this in mind, we place all of 
our hyperparameters together in a single cell within our Jupyter notebook. The code is 
provided in Example 11.13. 

Example 11.13 Setting dense sentiment classifier hyperparameters 

# output directory name: 
output_dir = 'model_output/dense' 

# training: 

epochs = 4 
batch_size = 128 

# vector-space embedding: 
n_dim = 64 

n_unique_words = 5000 
n_words_to_skip = 50 
max_review_length = 100 
pad_type = trunc_type = ' pre' 

# neura 7 network architecture: 
n_dense = 64 

dropout =0.5 


Lefs break down the purpose of each of these variables: 

■ output_di r: A directory name (ideally, a unique one) in which to store our 
models parameters after each epoch, allowing us to return to the parameters from 
any epoch of our choice at a later time. 

■ epochs: The number of epochs that we’d like to train for, noting that NLP models 
often overfit to the training data in fewer epochs than machine vision models. 
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■ batch_si ze: As before, the number of training examples used during each round of 
model training (see Figure 8.5). 

■ n_di m: The number of diniensions we’d like our word-vector space to have. 

■ n_uni que_words: With word2vec earlier in this chapter, we included tokens in our 
word-vector vocabulary only if they occurred at least a certain number of times 
within our corpus. An alternative approach—the one we take here—is to sort ali 
of the tokens in our corpus by the number of times they occur, and then only use 

a certain number of the most popular words. Andrew Maas and his coworkers 30 
opted to use the 5,000 most popular words across their film-review corpus and so 
we'11 do the sanie. 31 

■ n_words_to_ski p: Instead of removing a manually curated list of stop words from 
their word-vector vocabulary, Maas et al. made the assumption that the 50 most 
frequently occurring words across their film-review corpus would serve as a decent 
list of stop words. We followed their lead and did the sanie. 32 

■ max_revi ew_l ength: Each movie review must have the sanie length so that Ten- 
sorFlow knows the shape of the input data that will be flowing through our deep 
learning model. For this model, we selected a review length of 100 words. 33 Any 
reviews longer than 100 are truncated. Any reviews shorter than 100 are padded 
with a special padding character (analogous to the zero padding that can be used in 
machine vision, as in Figure 10.3). 

■ pad_type: By selecting ' pre ', we add padding characters to the start of every 
review. The alternative is ' post ', which adds them to the end. With a dense net- 
work like the one in this notebook, it shouldn’t make much difference which of 
these options we pick. Later in this chapter, when were working with specialized, 
sequential-data layer types, 34 it’s generally best to use ' pre ' because the content 
at the end of the document is more influential in the model and so we want the 
largely uninformative padding characters to be at the beginning of the document. 

■ trunc_type: As with pad_type, our truncation options are ' pre ' or ' post' . The 
former will remove words from the beginning of the review, whereas the latter will 
remove them from the end. By selecting ' pre', we re making (a bold!) assumption 
that the end of film reviews tend to include more information on review sentiment 
than the beginning. 

■ n_dense: The number of neurons to include in the dense layer of our neu- 
ral network architecture. We waved our finger in the air to select 64, so some 


30. We mentioned Maas et al. (2011) earlier in this chapter. They put together the movie-review corpus were 
using in this notebook. 

31. This 5,000-word threshold may not be optimal, but we didn’t take the time to test lower or higher values. You 
are most welcome to do so yourself! 

32. Note again that following Maas et al.s lead may not be the optimal choice. Further, note that this means we’ll 
actually be including the 51 st most popular word through to the 5050th most popular word in our word-vector 
vocabulary. 

33. You are free to experiment with lengthier or shorter reviews. 

34. For example, RNN, LSTM. 
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experimentation and optimization are warranted at your end if you feel like it. For 
simplicitys sake, we also are using a single layer of dense neurons, but you could opt 
to have several. 

■ dropout: How much dropout to apply to the neurons in the dense layer. Again, we 
did not take the time to optimize this hyperparameter (set at 0.5) ourselves. 

Loading in the film review data is a one-liner, provided in Example 11.14. 

Example 11.14 Loading IMDb film review data 

(x_train, y_tratn), (x_valid, y_valid) = \ 

imdb.load_data(num_words=n_unt que_words, skip_top=n_words_to_skt p) 

This dataset from Maas et al. (2011) is made up of the natural language of reviews from 
the publicly available Internet Movie Database (IMDb; 1 mdb . com). It consists of 50,000 
reviews, half of which are in the training dataset (x_trai n), and half of which are for 
model validation (x_val i d). When submitting their review of a given film, users also 
provide a star rating, with a maximum of 10 stars. The labeis (y_trai n and y_val i d) are 
binary, based on these star ratings: 

■ Reviews with a score of four stars or fewer are considered to be a negative review 

(y = 0 ). 

■ Reviews with a score of seven stars or more, meanwhile, are classed as a positive 
review (y = 1). 

■ Moderate reviews—those with five or six stars—are not included in the dataset, 
making the binary classification task easier for any model. 

By specifying values for the num_words and ski p_top arguments when calling 
i mdb . I oad_data () , we are hmiting the size of our word-vector vocabulary and removing 
the most common (stop) words, respectively. 


In our Dense Sentiment Classifier notebook, we have the convenience of loading our 
IMDb film-review data via the Keras i mdb . 1 oad_data () method. When youre work- 
ing with your own natural language data, you’ll likely need to preprocess many aspects 
of the data yourself. In addition to the general preprocessing guidance we provided 
earlier in this chapter, Keras provides a number of convenient text preprocessing 
Utilities, as documented online at keras . io/preprocessi ng/text. In particular, the 
Tokeni zer () class may enable you to carry out all of the preprocessing steps you need 
in a single line of code, including 

■ Tokenizing a corpus to the word level (or even the character level) 

■ Setting the size ofyour word-vector vocabulary (with num_words) 

■ Filtering out punctuation 

■ Converting all characters to lowercase 

■ Converting tokens into an integer index 
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Examining the IMDb Data 

Executing x_trai n [0 : 6], we can examine the first six reviews froni the training dataset, 
the first two ofwhich are shown in Figure 11.12. These reviews are natively in an 
integer-index format, where each unique token from the dataset is represented by an 
integer. The first few integers are special cases, following a general convention that is 
widely used in NLP: 

■ 0: Reserved as the padding token (which we’ll soon add to the reviews that are 
shorter than max_revi ew_l ength). 

■ 1: Would be the starting token, which would indicate the beginning of a review. As 
per the next bullet point, however, the starting token is among the top 50 most 
common tokens and so is shown as “unknown.” 

■ 2: Any tokens that occur very frequently across the corpus (i.e., theyre in the top 
50 most common words) or rarely (i.e., theyre below the top 5,050 most common 
words) will be outside of our word-vector vocabulary and so are replaced with this 
unknown token. 

■ 3: The most frequently occurring word in the corpus. 

■ 4: The second-most frequently occurring word. 

■ 5: The third-most frequently occurring, and so on. 

Using the following code from Example 11.15, we can see the length of the first six 
reviews in the training dataset. 

Example 11.15 Printing the number of tokens in six reviews 

for x in x_train [0:6]: 
pri nt (1 en (x)) 

They are rather variable, ranging from 43 tokens up to 550 tokens. Shortly, we’ll handle 
these discrepancies, standardizing all reviews to the same length. 

The film reviews are fed into our neural network model in the integer-index format 
of Figure 11.12 because this is a memory-efficient way to store the token information. 

It would require appreciably more memory to feed the tokens in as character strings, for 
example. For us humans, however, it is uninformative (and, frankly, uninteresting) to 
examine reviews in the integer-index format. To view the reviews as natural language, 
we create an index of words as follows, where PAD, START, and UNK are customary for 
representing padding, starting, and unknown tokens, respectively: 

word_index = keras.datasets.imdb.get_word_index() 
word_index = {k:(v+3) for k,v in word„index.items()} 
word_index[ "PAD" ] = 0 
word_index ["START"] = 1 
word_index[ "UNK" ] = 2 

index_word = { v : k for k,v in word_index. i tems ()} 
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array([ [ 2 , 2 , 2 , 2 , 2 , 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 2, 

173, 2, 256, 2, 2, 100, 2, 838, 112, 50, 670, 2, 2, 2, 480, 284, 2, 150, 

2, 172, 112, 167, 2, 336, 385, 2, 2, 172, 4536, 1111, 2, 546, 2, 2, 447, 

2, 192, 50, 2, 2, 147, 2025, 2, 2, 2, 2, 1920, 4613, 469, 2, 2, 71, 87, 

2, 2, 2, 530, 2, 76, 2, 2, 1247, 2, 2, 2, 515, 2, 2, 2, 626, 2, 2, 2, 62, 

386, 2, 2, 316, 2, 106, 2, 2, 2223, 2, 2, 480, 66, 3785, 2, 2, 130, 2, 2, 

2, 619, 2, 2, 124, 51, 2, 135, 2, 2, 1415, 2, 2, 2, 2, 215, 2, 77, 52, 2, 

2, 407, 2, 82, 2, 2, 2, 107, 117, 2, 2, 256, 2, 2, 2, 3766, 2, 723, 2, 7 

1, 2, 530, 476, 2, 400, 317, 2, 2, 2, 2, 1029, 2, 104, 88, 2, 381, 2, 29 

7, 98, 2, 2071, 56, 2, 141, 2, 194, 2, 2, 2, 226, 2, 2, 134, 476, 2, 480, 

2, 144, 2, 2, 2, 51, 2, 2, 224, 92, 2, 104, 2, 226, 65, 2, 2, 1334, 88, 

2, 2, 283, 2, 2, 4472, 113, 103, 2, 2, 2, 2, 2, 178, 2], 

(2, 194, 1153, 194, 2, 78, 228, 2, 2, 1463, 4369, 2, 134, 2, 2, 71 
5, 2, 118, 1634, 2, 394, 2, 2, 119, 954, 189, 102, 2, 207, 110, 3103, 2, 

2, 69, 188, 2, 2, 2, 2, 2, 249, 126, 93, 2, 114, 2, 2300, 1523, 2, 647, 

2, 116, 2, 2, 2, 2, 229, 2, 340, 1322, 2, 118, 2, 2, 130, 4901, 2, 2, 100 
2, 2, 89, 2, 952, 2, 2, 2, 455, 2, 2, 2, 2, 1543, 1905, 398, 2, 1649, 2, 

2, 2, 163, 2, 3215, 2, 2, 1153, 2, 194, 775, 2, 2, 2, 349, 2637, 148, 60 

5, 2, 2, 2, 123, 125, 68, 2, 2, 2, 349, 165, 4362, 98, 2, 2, 228, 2, 2, 

2, 1157, 2, 299, 120, 2, 120, 174, 2, 220, 175, 136, 50, 2, 4373, 228, 2, 

2, 2, 656, 245, 2350, 2, 2, 2, 131, 152, 491, 2, 2, 2, 2, 1212, 2, 2, 2, 

371, 78, 2, 625, 64, 1382, 2, 2, 168, 145, 2, 2, 1690, 2, 2, 2, 1355, 2, 

2, 2, 52, 154, 462, 2, 89, 78, 285, 2, 145, 95], 

Figure 11.12 The first two film reviews from the training dataset of Andrew Maas and 
colleagues' (2011) IMDb dataset. Tokens are in an integer-index format. 


Then we can use the code in Example 11.16 to view the film review of our choice—in 
this case, the first review from the training data. 

Example 11.16 Printing a review as a character string 

' '.join(index_word[id] for id in x_train[0]) 

The resulting string should look identical to the output shown in Figure 11.13. 

Remembering that the review in Figure 11.13 contains the tokens that are fed into 
our neural network, we might nevertheless find it enjoyable to read the full review with- 
out ali of the UNK tokens. In some cases of debugging model results, it might indeed even 
be practical to be able to view the full review. For example, if we’re being too aggressive 
or conservative with either our n_uni que_words or n_words_to_ski p thresholds, it 
might become apparent by comparing a review like the one in Figure 11.13 with a full 


"UNK UNK UNK UNK UNK brilliant casting location scenery story direction e 
veryone's really suited UNK part UNK played UNK UNK could UNK imagine bei 
ng there robert UNK UNK UNK amazing actor UNK now UNK same being director 
UNK father came UNK UNK same scottish island UNK myself UNK UNK loved UNK 
fact there UNK UNK real connection UNK UNK UNK UNK witty remarks througho 
ut UNK UNK were great UNK UNK UNK brilliant UNK much UNK UNK bought UNK U 
NK UNK soon UNK UNK UNK released UNK UNK UNK would recommend UNK UNK ever 
yone UNK watch UNK UNK fly UNK UNK amazing really cried UNK UNK end UNK U 
NK UNK sad UNK UNK know what UNK say UNK UNK cry UNK UNK UNK UNK must UNK 
been good UNK UNK definitely UNK also UNK UNK UNK two little UNK UNK play 
ed UNK UNK UNK norman UNK paul UNK were UNK brilliant children UNK often 
left UNK UNK UNK UNK list UNK think because UNK stars UNK play them UNK g 
rown up UNK such UNK big UNK UNK UNK whole UNK UNK these children UNK ama 
zing UNK should UNK UNK UNK what UNK UNK done don't UNK think UNK whole s 
tory UNK UNK lovely because UNK UNK true UNK UNK someones life after UNK 
UNK UNK UNK UNK US UNK" 


Figure 11.13 The first film review from the training dataset, now shown as a character 

string 
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"START this film was just brilliant casting location scenery story direct 
ion everyone's really suited the part they played and you could just imag 
ine being there robert red£ord's is an amazing actor and now the same bei 
ng director norman’s father came from the same scottish island as myself 
so i loved the fact there was a real connection with this film the witty 
remarks throughout the film were great it was just brilliant so much that 
i bought the film as soon as it was released for retail and would recomme 
nd it to everyone to watch and the fly fishing was amazing really cried a 
t the end it was so sad and you know what they say if you cry at a film i 
t must have been good and this definitely was also congratulations to the 
two little boy's that played the part's of norman and paul they were just 
brilliant children are often left out of the praising list i think becaus 
e the stars that play them all grown up are such a big profile for the wh 
ole film but these children are amazing and should be praised for what th 
ey have done don't you think the whole story was so lovely because it was 
true and was someones life after all that was shared with us all" 

Figure 11.14 The first film review from the training dataset, now shown in full as a 

character string 


one. With our index ofwords (i ndex_words) already available to us, we simply need to 
download the full reviews: 

(all_x_train,_),(all_x_valid,_) = imdb.load_data() 

Then we modify Example 11.16 to execute j oi n () on the full-review list of our choice 
(i.e., al l_x_train or al l_x_val id), as provided in Example 11.17. 

Example 11.17 Print full review as character string 

' ' . join(index_word[id] for id in all_x_trai n [0]) 

Executing this outputs the full text of the review of our choice—again, in this case, the 
first training review—as shown in Figure 11.14. 

Standardizing the Length of the Reviews 

By executing Example 11.15 earlier, we discovered that there is variability in the length 
of the film reviews. In order for the Keras-created TensorFlow model to run, we need 
to specify the size of the inputs that will be flowing into the model during training. This 
enables TensorFlow to optimize the allocation of memory and compute resources. Keras 
provides a convenient pad_sequences ( ) method that enables us to both pad and truncate 
documents of text in a single line. Here we standardize our training and validation data in 
this way, as shown in Example 11.18. 

Example 11.18 Standardizing input length by padding and truncating 

x_train = pad_sequences(x_train, maxlen=max_review_length, 

padding=pad_type, truncating=trunc_type, value=0) 
x_valid = pad„sequences(x_valid, maxlen=max_review_length, 

padding=pad_type, truncating=trunc_type, value=0) 
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‘PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD 
PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD P 
AD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PAD PA 
D PAD PAD UNK begins better than UNK ends funny UNK UNK russian UNK crew 
UNK UNK other actors UNK UNK those scenes where documentary shots UNK UNK 
spoiler part UNK message UNK UNK contrary UNK UNK whole atory UNK UNK doe 
S UNK UNK UNK UNK' 

Figure 11.15 The sixth film review from the training dataset, padded with the PAD 
token at the beginning so that—like ali the other reviews—it has a length of 100 tokens 


Now, when printing reviews (e.g., with x_trai n [0 : 6]) or their lengths (e.g., with the 
code from Example 11.15), we see that ali of the reviews have the same length of 100 
(because we set max_revi ew_l ength = 100). Examining x_trai n [5]—which previously 
had a length of only 43 tokens—with code similar to Example 11.16, we can observe that 
the beginning of the review has been padded with 57 PAD tokens (see Figure 11.15). 

Dense Network 

With sufficient NLP theory behind us, as well as our data loaded and preprocessed, 
we’re at long last prepared to make use of a neural network architecture to classify film 
reviews by their sentiment. A baseline dense network model for this task is shown in 
Example 11.19. 

Example 11.19 Dense sentiment classifier architecture 

model = Sequential() 

model,add(Embedding(n_unique_words, n_dim, 

input_length=max_review_length)) 

model.add (FI atten()) 

model.add(Dense(n_dense, acti vation=' reiu' )) 
model.add(Dropout(dropout)) 

# model .add (Dense (n_dense, acti vati on= ' rei u ') ) 

# model.add(Dropout(dropout)) 

model.add(Dense(1, acti vation= 'sigmoid' )) 

Let’s break the architecture down line by line: 

■ We’re using a Keras Sequenti al () method to invoke a sequential model, as we 
have for ali of the models so far in this book. 

■ As with word2vec, the Embeddi ng () layer enables us to create word vectors from a 
corpus of documents—in this case, the 25,000 movie reviews of the IMDb training 
dataset. Relative to independently creating word vectors with word2vec (or GloVe, 
etc.) as we did earlier in this chapter, training your word vectors via backpropaga- 
tion as a component of your broader NLP model has a potential advantage: The 
locations that words are assigned to within the vector space reflect not only word 
similarity but also the relevance of the words to the ultimate, specific purpose of 
the model (e.g., binary classification ol IMDb reviews by sentiment). The size of 
the word-vector vocabulary and the number of dimensions of the vector space are 
specified by n_unique_words and n_dim, respectively. Because the embedding layer 


230 


Chapter 11 Natural Language Processing 


is the first hidden layer in our network, we must also pass into it the shape of our 
input layer: We do this with the i nput_l ength argument. 

■ As in Chapter 10, the FI atten () layer enables us to pass a many-dimensional 
output (here, a two-dimensional output from the embedding layer) into a one- 
dimensional dense layer. 

■ Speaking of Dense () layers, we used a single one consisting of rei u activations in 
this architecture, with Dropout () applied to it. 

■ We opted for a fairly shallow neural network architecture for our baseline model, 
but you can trivially deepen it by adding further Dense () layers (see the lines that 
are commented out). 

■ Finally, because there are only two classes to classify, we require only a single output 
neuron (because, as discussed earlier in this chapter, if one class has the probability p 
then the other class has the probability 1 — p). This neuron is si gmoi d because we’d 
like it to output probabilities between 0 and 1 (refer to Figure 6.9). 



In addition to training word vectors on natural language data alone (e.g., with word2vec 
or GloVe) or training them with an embedding layer as part of a deep learning model, 
pretrained word vectors are also available Online. 

As with using a ConvNet trained on the millions of images in ImageNet 
(Chapter 10), this natural language transfer learning is powerful, because these word 
vectors may have been trained on extremely large corpuses (e.g., ali of Wikipedia, or 
the English-language Internet) that provide large, nuanced vocabularies that would be 
expensive to train yourself. Examples of pretrained word vectors are available at gi thub 
.com/Kyubyong/wordvectors and nlp.stanford.edu/projects/glove. The fast- 
Text library also offers subword embeddings in 157 languages; these can be downloaded 
from f asttext. cc. 

In this book, we don’t cover substituting pretrained word vectors (be they down¬ 
loaded or trained separately from your deep learning model, as we did with Word2Vec () 
earlier in this chapter) in place of the embedding layer, because there are many differ¬ 
ent permutations on how you might like to do this. For a neat tutorial from Fran^ois 
Chollet, the creator of Keras, go to bi t. 1 y / preTrai ned. 


Executing model . summary (), we discover that our fairly simple NLP model has quite 
a few parameters, as shown in Figure 11.16: 

■ In the embedding layer, the 320,000 parameters come from having 5,000 words, 
each one with a location specified in a 64-dimensional word-vector space 

(64 x 5,000 = 320,000). 

■ Flowing out of the embedding layer through the flatten layer and into the dense 
layer are 6,400 values: Each of our film-review inputs consists of 100 tokens, with 
each token specified by 64 word-vector-space coordinates (64 x 100 = 6,400). 

■ Each of the 64 neurons in the dense hidden layer receives input from each of the 
6,400 values flowing out of the flatten layer, for a total of 64 x 6,400 = 409,600 
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Layer (type) 

Output Shape 

Param # 

embedding_l (Embedding) 

(None, 

100, 64) 

320000 

flatten_l (Flatten) 

(None, 

6400) 

0 

dense_l (Dense) 

(None, 

64) 

409664 

dropout_l (Dropout) 

(None, 

64) 

0 

dense_2 (Dense) 

(None, 

1) 

65 

Total params: 729,729 

Trainable params: 729,729 

Non-trainable params: 0 


Figure 11.16 Dense sentiment classifier model summary 


weights. And, of course, each of the 64 neurons has a bias, for a total of 409,664 
parameters in the layer. 

■ Finally, the single neuron of the output layer has 64 weights—one for the activation 
output by each of the neurons in the preceding layer—plus its bias, for a total of 65 
parameters. 

■ Summing up the parameters from each of the layers, we have a grand total of 
730,000 of them. 

As shown in Example 11.20, we compile our dense sentiment classifier with a line 
of code that should already be familiar from recent chapters, except that—because we 
have a single output neuron within a binary classifier—we use bi nary_crossentropy 
cost in place of the categori cal_crossentropy cost we used for our multiclass MNIST 
classifiers. 

Example 11.20 Compiling our sentiment classifier 

model .compile(loss='binary_crossentropy ' , optimizer= 'adam' , 
metrics=[ 'accuracy' ]) 

With the code provided in Example 11.21, we create a Model Checkpoi nt () object 
that will allow us to save our model parameters after each epoch during training. By 
doing this, we can return to the parameters from our epoch of choice later on during 
model evaluation or to make inferences in a production system. If the output_di r direc- 
tory doesn’t already exist, we use the makedi rs () method to make it. 

Example 11.21 Creating an object and directory for checkpointing model parame¬ 
ters after each epoch 

modelcheckpoint = ModelCheckpoint(filepath=output_dir+ 

"/weights.(epoch:02d}.hdf5" ) 

if not os.path.exists(output_dir) : 
os.makedirs(output_dir) 










232 Chapter 11 Natural Language Processing 


Train on 25000 saepies, validate on 25000 samples 
Epoch 1/4 

25000/25000 [ ========================== ] - 2s 80us/step - loss: 0.5612 - acc: 0.6892 - val.loss: 0.3630 - val.acc: 0.8398 

Epoch 2/4 

25000/25000 [== === = == == ========= = ] - 2s 69us/step - loss: 0.2851 - acc: 0.8841 - val loss: 0.3486 - val acc: 0.8447 

Epoch 3/4 

25000/25000 [ == ===== = ===== == ====— = 1 - 2s 70us/step - loss: 0.1158 - acc: 0.9646 - val_loss: 0.4252 - val_acc: 0.8337 

Epoch 4/4 

25000/25000 [ =========== == ============ ==) - 2s 70us/step - loss: 0.0237 - acc: 0.9961 - val_loss: 0.5304 - val_acc: 0.8340 

Figure 11.17 Training the dense sentiment classifier 

Like the compile step, the model-fitting step (Exaniple 11.22) for our sentiment classi¬ 
fier should be familiar except, perhaps, for our use of the cal 1 backs argument to pass in 
the mode! checkpoi nt object. 35 

Example 11.22 Fitting our sentiment classifier 

mode!.fit(x_train, y_train, 

batch_size=batch_size, epochs=epochs, verbose=1, 
vaiidation_data=(x_vaiid, y_valid), 
caiibacks=[modeicheckpoint]) 

As shown in Figure 11.17, we achieve our lowest validation loss (0.349) and highest val- 
idation accuracy (84.5 percent) in the second epoch. In the third and fourth epochs, the 
model is heavily overfit, with accuracy on the training set considerably higher than on 
the validation set. By the fourth epoch, training accuracy stands at 99.6 percent while 
validation accuracy is niuch lower, at 83.4 percent. 

To evaluate the results of the best epoch more thoroughly, we use the Keras 1 oad_ 
wei ghts () method to load the parameters from the second epoch (wei ghts . 02 . hdf 5) 
back into our model, as in Exaniple 11.23. 36-37 

Example 11.23 Loading model parameters 

model.1oad_weights(output_dir+" /wei ghts.02 . hdf5" ) 

We can then calculate validation set y values for the best epoch by passing the 
predi ct_proba () method on the x_val i d dataset, as shown in Exaniple 11.24. 

Example 11.24 Predictingy for all validation data 

y_hat = model.predict_proba(x_valid) 


35. This isn’t our first use of the cal 1 backs argument. We previously used this argument, which can take in a list 
of multiple different callbacks, to provide data on model training progress to TensorBoard (see Chapter 9). 

36. Although the method is called load_weights(), it loads in all model parameters, including biases. Because 
weights typically constitute the vast majority of parameters in a model, deep learning practitioners often call 
parameter files “weights” files. 

37. Earlier versions of Keras used zero indexing for epochs, but more recent versions index starting at 1. 
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With y_hat [0] , for example, we can now see the models prediction of the sentiment of 
the first movie review in the validation set. For this review, y = 0.09, indicating the 
model estimates that theres a 9 percent chance the review is positive and, therefore, a 91 
percent chance it’s negative. Executing y_val i d [0] informs us that y = 0 for this 
review—that is, it is in fact a negative review—so the models y is pretty good! Ifyoure 
curious about what the content of the negative review was, you can run a slight modifica- 
tion on Example 11.17 to access the full text of the al l_x_val i d [0] list item, as shown 
in Example 11.25. 

Example 11.25 Printing a full validation review 

' 1 . join(index_word[id] for id in all_x_val i d [0]) 

Examining individual scores can be interesting, but we get a much better sense of our 
models performance by looking at all of the validation results together. We can plot a 
histogram of all the validation y values by running the code in Example 11.26. 

Example 11.26 Plotting a histogram of validation data y values 

pl t.hi st(y_hat) 

_ = plt.axvline(x=0 .5 , coior=' orange 1 ) 

The histogram output is provided in Figure 11.18. The plot shows that the model often 
has a strong opinion on the sentiment of a given review: Some 8,000 of the 25,000 re- 
views (~32 percent of them) are assigned a y of less than 0.1, and ~6,500 (~26 percent) 
are given a y greater than 0.9. 

The vertical orange line in Figure 11.18 marks the 0.5 threshold above which reviews 
are considered by a simple accuracy calculation to be positive. As discussed earlier in the 
chapter, such a simple threshold can be misleading, because a review with a y just be- 
low 0.5 is not predicted by the model to have much difference in sentiment relative to 
a review with a y just above 0.5. To obtain a more nuanced assessment of our models 
performance as a binary classifier, we can use the roc_auc_score () method from the 



Figure 11.18 Histogram of validation data y values for the second epoch of our dense 

sentiment classifier 
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scikit-leam metrics library to straightforwardly calculate the R.OC AUC score across the 
validation data, as shown in Example 11.27. 

Example 11.27 Calculating ROC AUC for validation data 

pct_auc = roc_auc_score(y_valid, y_hat)*100.0 
"{:0.2f}". format(pct_auc) 

Printing the output in an easy-to-read format with the format () method, we see that the 
percentage of the area under the receiver operating characteristic curve is (a fairly high) 
92.9 percent. 

To get a sense of where the model breaks down, we can create a DataFrame of y and 
y validation set values, using the code in Example 11.28. 

Example 11.28 Creating a ydf DataFrame of y and y values 

float_y_hat = [] 
for y in y_hat: 

f1oat_y_hat.append(y[0]) 

ydf = pd.DataFrame(li st(zip(float_y_hat, y_valid)), 
coiumns=[ 'y_hat' , 'y']) 

Printing the first 10 rows of the resulting ydf DataFrame with ydf. head (10), we see the 
output shown in Figure 11.19. 

Querying the ydf DataFrame as we do in Examples 11.29 and 11.30 and then 
examining the individual reviews these queries surface by varying the list index in 


y.hat y 
0 0.089684 0 

1 0.982754 1 

2 0.746905 1 

3 0.543328 0 

4 0.997054 1 

5 0.833994 1 

6 0.766254 1 

7 0.008032 0 

8 0.812743 0 

9 0.729463 1 


Figure 11.19 DataFrame of y and y values for the IMDb validation data 
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"START wow another kevin costner hero movie postman tin cup waterworld bo 
dyguard wyatt earp robin hood even that baseball movie seems like he make 
s movies specifically to be the center of attention the characters are al 
most always the same the heroics the flaws the greatness the fall the red 
emption yup within the lst 5 minutes of the movie we're all supposed to b 
e in awe of his character and it builds up more and more frora there br br 
and this time the story story is just a collage of different movies you d 
on't need a spoiler you've seen this movie several times though it had di 
fferent tities you’11 know what will happen way before it happens this is 
like mixing an officer and a gentleman with but both are easily better mo 
vies watch to see how this kind of movie should be made and also to see h 
ow an good but slightly underrated actor russell plays the hero" 

Figure 11.20 An example of a false positive: This negative review was misclassified as 

positive by our model. 


Example 11.25, you can get a sense ofthe kinds ofreviews that cause the model to 
make its largest errors. 

Example 11.29 Ten cases of negative validation reviews with high y scores 

ydf[(ydf.y == 0) & (ydf.y_hat > 0. 9)].head(10) 

Example 11.30 Ten cases of positive validation reviews with low y scores 

ydf[(ydf.y == 0) & (ydf.y_hat > 0. 9)].head(10) 

An example of a false positive—a negative review ( y = 0) with a very high model 
score ( y = 0.97)—that was identified by running the code in Example 11.29 is provided 
in Figure 11.20. 3S And an example of a false negative—a positive review ( y = 1) with 
a very low model score ( y = 0.06)—that was identified by running the code in 
Example 11.30 is provided in Figure 11.21. 39 Carrying out this kind ofpost hoc anal- 
ysis of our model, one potential shortcoming that surfaces is that our dense classifier is 
not specialized to detect patterns of multiple tokens occurring in a sequence that might 
predict film-review sentiment. For example, it might be handy for patterns like the 
token-pair not-good to be easily detected by the model as predictive of negative sentiment. 

Convolutional Networks 

As covered in Chapter 10, convolutional layers are particularly adept at detecting spa- 
tial patterns. In this section, we use them to detect spatial patterns among words—like 
the not-good sequence—and see whether they can improve upon the performance of 
our dense network at classifying film reviews by their sentiment. All of the code for this 
ConvNet can be found in our Convolutional Sentiment Classifier notebook. 


38. We output this particular review—the 387th in the validation dataset—by running the following code: 
' '.join(index_word[id] for id in all_x_valid[386]). 

39. Run ' ' . j oi n (i ndex_word [id] for id in al l_x_val i d [224] ) to print out this same review yourself. 
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"START finally a true horror movie this is the first time in years that i 
had to cover my eyes i am a horror buff and i recommend this movie but it 
is quite gory i am not a big wrestling fan but kane really pulled the who 
le monster thing off i have to admit that i didn't want to see this movie 
my 17 year old dragged me to it but am very glad i did during and after t 
he movie i was looking over my shoulder i have to agree with others about 
the whole remake horror movies enough is enough i think that is why this 
movie is getting some good reviews it is a refreshing change and takes yo 
u back to the texas chainsaw first one michael myers and jason and no egi 
crap" 

Figure 11.21 An example of a false negative: This positive review was misclassified as 

negative by our model. 

The dependencies for this model are identical to those of our dense sentiment clas- 
sifier (see Example 11.12), except that it has three new Keras layer types, as provided in 
Example 11.31. 

Example 11.31 Additional CNN dependencies 

from keras.layers import ConvID, G1 obalMaxPoolingl D 
from keras.layers import SpatialDropoutl D 

The hyperparameters for our convolutional sentiment classifier are provided in 
Example 11.32. 

Example 11.32 Convolutional sentiment classifier hyperparameters 

# output directory name: 
output_dir = 'model_output/conv' 

# training: 

epochs = 4 
batch_size = 128 

# vector-space embedding: 
n_dim = 64 

n_unique_words = 5000 
max_review_length = 400 
pad_type = trunc_type = 'pre' 
drop_embed = 0.2 # new! 

# convolutional layer architecture: 
n_conv = 256 # filters, a.k.a. kernels 
k_conv = 3 # kernel 1ength 

# dense layer architecture: 

n_dense = 256 
dropout =0.2 
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Relative to the hyperparameters from our dense sentiment classifier (see Example 11.13): 

■ We have a new, unique directory narne (' conv ') for storing model parameters after 
each epoch of training. 

■ Our number of epochs and batch size remain the same. 

■ Our vector-space embedding hyperparameters remain the same, except that 

■ We quadrupled max_revi ew_l ength to 400. We did this because, despite 
the fairly dramatic increase in input volume as well as an increase in our 
number of hidden layers, our convolutional classifier will stili have far fewer 
parameters relative to our dense sentiment classifier. 

■ With drop_embed, we’ll be adding dropout to our embedding layer. 

■ Our convolutional sentiment classifier will have two hidden layers after the embed- 
ding layer: 

■ A convolutional layer with 256 filters (n_conv), each with a single dimension 
(a length) of 3 (k_conv). When working with two-dimensional images in 
Chapter 10, our convolutional layers had filters with two dimensions. Natural 
language—be it written or spoken—has only one dimension associated with 
it (the dimension of time) and so the convolutional layers used in this chapter 
will have one-dimensional filters. 

■ A dense layer with 256 neurons (n_dense) and dropout of 20 percent. 

The steps for loading the IMDb data and standardizing the length of the reviews 

are identical to those in our Dense Sentiment Classifier notebook (see Examples 11.14 
and 11.18). The model architecture is of course rather different, and is provided in 
Example 11.33. 

Example 11.33 Convolutional sentiment classifier architecture 

model = Sequential() 

# vector-space embedding: 

model .add(Embedding(n_um'que_words, n_dim, 

input_length=max_review_length)) 
model,add(SpatialDropout1D(drop_embed)) 

# convolutional layer: 

model,add(Conv1D(n_conv, k_conv, acti vation= 'reiu ')) 

# model ,add(Conv1D(n_conv, k_conv, acti vati on= ' rei u ') ) 
model.add(G1obalMaxPoolinglD()) 

# dense layer: 

model.add(Dense(n_dense, acti vation=' reiu' )) 
model.add(Dropout(dropout)) 

# output layer: 

model.add(Dense(1, acti vation= 'sigmoid' )) 
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Breaking the model down: 

■ Our embedding layer is the sanie as before, except that it now has dropout applied 
to it. 

■ We no longer require FI atten (), because the Convl D() layer takes in both dimen- 
sions of the embedding layer output. 

■ We use rei u activation within our one-dimensional convolutional layer. The layer 
has 256 unique filters, each of which is free to specialize in activating when it passes 
over a particular three-token sequence. The activation map for each of the 256 
filters has a length of 398, for a 256x398 output shape. 40 

■ If you fancy it, you’re welcome to add additional convolutional layers, by, for 
example, uncommenting the second ConvlD() line. 

■ Global max-pooling is connnon for dimensionality reduction within deep learning 
NLP models. We use it here to squash the activation map from 256 x 398 to 
256 X 1. By applying it, only the magnitude of largest activation for a given 
convolutional filter is retained by the maximum-calculating operation, and we 
lose any temporal-position-specific information the filter may have output to its 
398-element-long activation map. 

■ Because the activations output from the global max-pooling layer are one- 
dimensional, they can be fed directly into the dense layer, which consists (again) 
of rei u neurons and dropout is applied. 

■ The output layer remains the sanie. 

■ The model has a grand total of 435,000 parameters (see Figure 11.22), several hun- 
dred thousand fewer than our dense sentiment classifier. Per epoch, this model will 


Layer (type) 

Output 

Shape 

Param # 

embedding_l (Embedding) 

(None, 

400, 

64) 

320000 

spatial_dropoutld_l (Spatial 

(None, 

400, 

64) 

0 

convld_l (ConvlD) 

(None, 

398, 

256) 

49408 

global_max_poolingld_l (Glob 

(None, 

256) 


0 

densel (Dense) 

(None, 

256) 


65792 

dropout_l (Dropout) 

(None, 

256) 


0 

dense_2 (Dense) 

(None, 

u 


257 


Total params: 435,457 
Trainable params: 435,457 
Non-trainable params: 0 

Figure 11.22 Convolutional sentiment classifier model summary 


40. As described in Chapter 10, when a two-dimensional filter convolves over an image, we lose pixels around 
the perimeter if we don’t pad the image first. In this natural language model, our one-dimensional convolutional 
filter has a length of three, so, on the far left of the movie review, it begins centered on the second token and, 
on the far right, it ends centered on the second-to-last token. Because we didn’t pad the movie reviews at both 
ends before feeding them into the convolutional layer, we thus lose a token’s worth of information from each end: 
400 — 1 — 1 = 398. We’re not upset about this loss. 
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lievertheless take longer to train because the convolutional operation is relatively 
computationally expensive. 

A critical item to note about this model architecture is that the convolutional filters 
are not detecting simply triplets of words. Rather, they are detecting triplets of word vec- 
tors. Following from our discussion in Chapter 2, contrasting discrete, one-hot word 
representations with the word-vector representations that gently smear meaning across a 
high-dimensional space (see Table 2.1), ali ofthe models in this chapter become special- 
ized in associating word meaning with review sentiment—as opposed to merely associating 
individual words with review sentiment. As an example, if the network learns that the 
token pair not-good is associated with a negative review, then it should also associate the 
pair not-great with negative reviews, because good and great have similar meanings (and 
thus should occupy a similar location in word-vector space). 

The compile, checkpoint, and model-fitting steps are the same as for our dense 
sentiment classifier (see Examples 11.20, 11.21, and 11.22, respectively). Model- 
fitting progress is shown in Figure 11.23. The epoch with the lowest validation loss 
(0.258) and highest validation accuracy (89.6 percent) was the third epoch. Loading 
the model parameters from that epoch back in (with the code from Example 11.23 but 
specifying weights . 03 . hdf5), we then predict y for all validation data (exactly as in 
Example 11.24). Creating a histogram (Figure 11.24) ofthese y values (with the same 
code as in Example 11.26), we can see visually that our CNN has a stronger opinion of 
review sentiment than our dense network did (refer to Figure 11.18): There are about a 
thousand more reviews with y < 0.1 and several thousand more with y > 0.9. Calcu- 
lating ROC AUC (with the code from Example 11.27), we output a very high score of 
96.12 percent, indicating that the CNN’s confidence was not misplaced: It is a marked 
improvement over the already high ~93 percent score of the dense net. 


Train on 25000 samples, 
Epoch 1/4 

validate on 25000 sanples 



Epoch 2/4 




Epoch 3/4 




Epoch 4/4 

25000/25000 1^=== 

] - 41s 2ms/step - loss: 0.1151 - acc: 

0.9589 - val loss: 

0.2828 - val acc: 0.8934 


Figure 11.23 Training the convolutional sentiment classifier 



Figure 11.24 Histogram of validation data y values for the third epoch of our 
convolutional sentiment classifier 
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Networks Designed for Sequential Data 

Our ConvNet classifier outperformed our dense net—perhaps in large part because its 
convolutional layer is adept at learning patterns of words that predict some outcome, such 
as whether a film review is favorable or negative. The filters within convolutional layers 
tend to excel at learning short sequences like triplets of words (recall that we set k = 3 in 
Example 11.32), but a document of natural language like a movie review might contain 
much longer sequences of words that, when considered ali together, would enable the 
model to accurately predict some outcome. To handle long sequences of data like this, 
there exists a family of deep learning models called recurrent neural networks (RNNs), 
which include specialized layer types like long short-term memory units (LSTMs) andgated 
recurrent units (GRUs). In this section, we cover the essential theory of RNNs and apply 
several variants of them to our movie-review classification problem. We also introduce 
attention —an especially sophisticated approach to modeling natural language data that is 
setting new benchmarks across NLP applications. 


As mentioned at the start of the chapter, the RNN family, including LSTMs and 
GRUs, is well suited to handling not only natural language data but also any input data 
that occur in a one-dimensional sequence. This includes price data (e.g., financial time 
series, stock prices), sales figures, temperatures, and disease rates (epidemiology). While 
RNN applications other than NLP are beyond the scope of this textbook, we collate 
resources for modeling quantitative data over time at jonkrohn.com/resources under 
the heading Time Series Prediction. 


Recurrent Neural Networks 

Consider the following sentences: 

Jon and Grant are writing a book together. They have really enjoyed writing it. 

The human mind can track the concepts in the second sentence quite easily. You already 
know that “they” in the second sentence refers to your authors, and “it” refers to the 
book were writing. Although this task is easy for you, however, it is not so trivial for a 
neural network. 

The convolutional sentiment classifier we built in the previous section was able to 
consider a word only in the context of the two words on either side of it (k_conv = 3, 
as in Example 11.32). With such a small window of text, that neural network had no 
capacity to assess what "they” or “it” might be referring to. Our human brains can do it 
because our thoughts loop around each other, and we revisit earlier ideas in order to in- 
form our understanding of the current context. In this section we introduce the concept 
of recurrent neural networks, which set out to do just that: They have loops built into 
their structure that allow information to persist over time. 

The high-level structure of a recurrent neural network (RNN) is shown in 
Figure 11.25. On the left, the purple line indicates the loop that passes information 
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Recurrent Neural NetWork RNN unpacked 



Figure 11.25 Schematic diagram of a recurrent neural network 


between steps in the network. As in a dense network, where there is a neuron for each 
input, so too is there a neuron for each input here. We can observe this more easily on 
the right, where the schematic of the RNN is unpacked. There is a recurrent module for 
each word in the sentence (only the first four words are shown here for brevity). 41 How- 
ever, each module receives an additional input from the previous module, and in doing 
so the network is able to pass along information from earlier timesteps in the sequence. 

In the case of Figure 11.25, each word is represented by a distinet timestep in the RNN 
sequence, so the network might be able to learn that “Jon” and “Grant” were writing 
the book, thereby associating these ternis with the word “they” that occurs later in the 
sequence. 

Recurrent neural networks are, computationally, more complex to train than exclu- 
sively “feedforward” neural networks like the dense nets and CNNs we’ve used so far 
in the book. As depicted in Figure 8.6, feedforward networks involve backpropagating 
cost from the output layer back toward the input layer. If a network includes a recurrent 
layer (such as Si mpl eRNN, LSTM, or GRU), then the cost must be backpropagated not only 
back toward the input layer, but back over the timesteps of the recurrent layer (from later 
timesteps back toward earlier timesteps), as well. Note that, in the same way that the gra- 
dient of learning vanishes as we backpropagate over later hidden layers toward earlier ones 
(see Figure 8.8), so, too, does the gradient vanish as we backpropagate over later timesteps 
within a recurrent layer toward earlier ones. Because of this, later timesteps in a sequence 
have more influence within the model than earlier ones do. 42 


41. This is also why we have to pad shorter sentences during preprocessing: The RNN expects a sequence of a 
particular length, and so if the sequence is not long enough we add PAD tokens to make up the difference. 

42. If you suspect that the beginning of your sequences (e.g., the words at the beginning of a movie review) is 
generally more relevant to the problem youre solving with your model (sentiment classification) than the end (the 
words at the end of the review), you can reverse the sequence before passing it as an input into your network. 
In that way, within your network s recurrent layers, the beginning of the sequence will be backpropagated over 
before the end is. 







242 Chapter 11 Natural Language Processing 


Implementing an RNN in Keras 

Adding a recurrent layer to a neural network architecture to create an RNN is straightfor- 
ward in Keras, as we illustrate in our RNN Sentiment Classifier Jupyter notebook. For the 
sake of brevity and readability, please note that the following code cells are identical across 
ali the Jupyter notebooks in this chapter, including the Dense and Convolutiona1 Sentiment 
Classifier notebooks that we’ve already covered: 

■ Loading dependencies (Example 11.12), except that there are often one or two 
additional dependencies in a given notebook. We’ll note these additions 
separately—typically when we present the notebooks neural network 
architecture. 

■ Loading IMDb film review data (Example 11.14). 

■ Standardizing review length (Example 11.18). 

■ Compihng the model (Example 11.20). 

■ Creating the Model Checkpoi nt () object and directory (Example 11.21). 

■ Fitting the model (Example 11.22). 

■ Loading the model parameters from the best epoch (Example 11.23), with the criti- 
cal exception that the particular epoch we select to load varies depending on which 
epoch has the lowest validation loss. 

■ Predicting y for ali validation data (Example 11.24). 

■ Plotting a histogram of y (Example 11.26). 

■ Calculatmg ROC AUC (Example 11.27). 

The code cells that vary are those in which we: 

1. Set hyperparameters 

2. Design the neural network architecture 

The hyperparameters for our RNN are as shown in Example 11.34. 

Example 11.34 RNN sentiment classifier hyperparameters 

# output directory name: 

output„dir = 'mode1_output/rnn' 

# training: 

epochs = 16 # way more! 
batch_size = 128 

# vector-space embedding: 
n_dim = 64 

n_unique_words = 10000 

max_revi ew_l ength = 100 # lowered due to vanishing gradi erit over time 
pad„type = trunc_type = 'pre' 
drop_embed =0.2 
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# RNN layer architecture: 
n_rnn = 256 
drop_rnn = 0.2 

Changes relative to our previous sentiment classifier notebooks are: 

■ We quadrupled epochs of training to 16 because overfitting didn’t occur in the early 
epochs. 

■ We lowered max_review_length back down to 100, although even this is excessive 
for a simple RNN. We can backpropagate over about 100 timesteps (i.e., 100 
tokens or words in a natural language rnodel) with an LSTM (covered in the 

next section) before the gradient of learning vanishes completely, but the gra- 
dient in a plain old RNN vanishes completely after about 10 timesteps. Thus, 
max_revi ew_l ength could probably be lowered to less than 10 before we would 
notice a reduction in this models performance. 

■ For all of the RNN-family architectures in this chapter, we experimented with 
doubling the word-vector vocabulary to 10000 tokens. This seemed to provide 
improved results for these architectures, although we didn’t test it rigorously. 

■ We set n_rnn = 256, so we could say that this recurrent layer has 256 units, or, 
alternatively, we could say it has 256 cells. In the same way that having 256 convolu- 
tional filters enabled our CNN rnodel to specialize in detecting 256 unique triplets 
of word meaning, 43 this setting enables our RNN to detect 256 unique sequences 
of word meaning that may be relevant to review sentiment. 

Our RNN rnodel architecture is provided in Example 11.35. 

Example 11.35 RNN sentiment classifier architecture 

from keras.layers import SimpleRNN 
rnodel = Sequential() 

rnodel,add(Embedding(n_unique_words, n_dim, 

input_length=max_review_length)) 
rnodel,add(SpatialDropout1D(drop_embed)) 
rnodel.add(SimpleRNN(n_rnn, dropout=drop_rnn)) 
rnodel.add(Dense(1, acti vation= 'sigmoid' )) 

In place of a convolutional layer or a dense layer (or both) within the hidden layers of this 
rnodel, we have a Keras Si mpl eRNN () layer, which has a dropout argument; as a resuit, 
we didn’t need to add dropout in a separate line of code. Unlike putting a dense layer 
after a convolutional layer, it is relatively uncommon to add a dense layer after a recurrent 
layer, because it provides little performance advantage. Youre welcome to try it by adding 
in a Dense () hidden layer anyway. 


43. “Word meaning” here refers to a location in word-vector space. 
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The results of running this model (which are shown in full in our RNN Sentiment 
Classifier notebook) were not encouraging. We found that the training loss, after going 
down steadily over the first half-dozen epochs, began to jump around after that. This 
indicates that the model is struggling to learn patterns even within the training data, 
which—relative to the validation data—it should be readily able to do. Indeed, ali of the 
models fit so far in this book have had training losses that reliably attenuated epoch over 
epoch. 

As the training loss bounced around, so too did the validation loss. We observed the 
lowest validation loss in the seventh epoch (0.504), which corresponded to a validation 
accuracy of 77.6 percent and an ROC AUC of 84.9 percent. All three of these met- 
rics are our worst yet for a sentiment classifier model. This is because, as we mentioned 
earlier in this section, RNNs are only able to backpropagate through ~10 time steps 
before the gradient diminishes so much that parameter updates become negligibly small. 
Because of this, simple RNNs are rarely used in practice: More-sophisticated recurrent 

layer types like LSTMs, which can backpropagate through ~100 time steps, are far more 
44 

common. 

Long Short-Term Memory Units 

As stated at the end of the preceding section, simple RNNs are adequate if the space 
between the relevant information and the context where it s needed is small (fewer than 
10 timesteps); however, ifthe task requires a broader context (which is often the case in 
NLP tasks), there is another recurrent layer type that is well suited to it: long short-term 
memory units, or LSTMs. 

LSTMs were introduced by Sepp Hochreiter and Jiirgen Schmidhuber in 1997, 4:1 but 
they are more widely used in NLP deep learning applications today than ever before. 

The basic structure of an LSTM layer is the sanie as the simple recurrent layers captured 
in Figure 11.25. LSTMs receive input froni the sequence of data (e.g., a particular token 
from a natural language document), and they also receive input froni the previous time 
point in the sequence. The diiference is that inside each cell in a simple recurrent layer 
(e.g., Si mpl eRNN () in Keras), you'll fmd a single neural network activation function such 
as a tanh function, which transforms the RNN cells inputs to generate its output. In 
contrast, the cells of an LSTM layer contain a far more complex structure, as depicted in 
Figure 11.26. 

This schematic can appear daunting, and, admittedly, we agree that a full step-by-step 
breakdown of each component inside of an LSTM cell is unnecessarily detailed for this 
book. 46 That said, there are a few key points that we should nevertheless touch on here. 
The first is the cell state running across the top of the LSTM cell. Notice that the cell 


44. The only situation we could think of where a simple RNN would be practical is one where your sequences 
only had 10 or fewer consecutive timesteps of information that are relevant to the problem youre solving with 
your model. This might be the case with some time series forecasting models or if you only had very short strings 
of natural language in your dataset. 

45. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735—80. 

46. For a thorough exposition of LSTM cells, we recommend Christopher 01ah’s highly visual explainer, which 
is available at bi t. 1 y /coi ah LSTM. 
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output 



Figure 11.26 Schematic diagram of an LSTM 


state does not pass through any nonlinear activation functions. In fact, the cell state only 
undergoes some minor linear transformations, but otherwise it simply passes through 
from cell to cell. Those two linear transformations (a multiplication and an addition 
operation) are points where a cell in an LSTM layer can add information to the cell state, 
information that will be passed onto the next cell in the layer. In either case, there is a 
sigmoid activation (represented by cr in the figure) before the information is added to the 
cell state. Because a sigmoid activation produces values between 0 and 1, these sigmoids 
act as “gates” that decide whether new information (from the current timestep) is added 
to the cell state or not. 

The new information at the current timestep is a simple concatenation of the current 
timestep’s input and the hidden state from the preceding timestep. This concatenation has 
two chances to be incorporated into the cell state—either linearly or following a nonlin¬ 
ear tanh activation—and in either case it’s those sigmoid gates that decide whether the 
information is combined. 

After the LSTM has determined what information to add to the cell state, another 
sigmoid gate decides whether the information from the current input is added to the 
final cell state, and this results in the output for the current timestep. Notice that, under 
a different name (“hidden state”), the output is also sent into the next LSTM module 
(which represents the next timestep in the sequence), where it is combined with the next 
timestep’s input to begin the whole process again, and that (alongside the hidden state) 
the final cell state is also sent to the module representing the next timestep. 
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We know this might be a lot to come to grips with. Another way to distill this LSTM 
content is: 

■ The cell state enables information to persist along the length of the sequence, 
through each timestep in a given LSTM cell. It is the long- term memory of the 
LSTM. 

■ The hidden state is analogous to the recurrent connections in a simple RNN and 
represents the short -term memory of the LSTM. 

■ Each module represents a particular point in the sequence of data (e.g., a particular 
token from a natural language document). 

■ At each timestep, several decisions are made (using those sigmoid gates) about 
whether the information at that particular timestep in the sequence is relevant to 
the local (hidden state) and global (cell state) contexts. 

■ The first two sigmoid gates determine whether the information from the current 
timestep is relevant to the global context (the cell state) and how it will be com- 
bined into that stream. 

■ The final sigmoid gate determines whether the information from the current 
timestep is relevant to the local context (i.e., whether it is added to the hidden state, 
which doubles as the output for the current timestep). 

We recommend taking a moment to reconsider Figure 11.26 and see ifyou can follow 
how information moves through an LSTM cell. This task should be easier ifyou keep in 
mind that the sigmoid gates decide whether information is let through or not. Regardless, 
the primary take-aways from this section are: 

■ Simple RNN cells pass only one type of information (the hidden state) between 
timesteps and contain only one activation function. 

■ LSTM cells are markedly more complex: They pass two types of information be¬ 
tween timesteps (hidden state and cell state) and contain five activation functions. 

Implementing an LSTM with Keras 

Despite ali of their additional computational complexity, as demonstrated within our 
LSTM Sentiment Classifier notebook, implementing LSTMs with Keras is a breeze. As 
shown in Example 11.36, we selected the same hyperparameters for our LSTM as we did 
for our simple RNN, except: 

■ We changed the output directory name. 

■ We updated variable names to n_l stm and drop_l stm. 

■ We reduced the number of epochs of training to 4 because the LSTM begins to 
overfit to the training data much earlier than the simple RNN. 

Example 11.36 LSTM sentiment classifier hyperparameters 

# output directory name: 

output_dir = 'mode!_output/LSTM' 


# training: 
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epochs = 4 
batch_size = 128 

# vector-space embedding: 
n_dim = 64 

n_uni quej/ords = 10000 
max_review_1ength = 100 
pad_type = trunc_type = 'pre' 
drop„embed =0.2 

# LSTM layer architecture: 
n_lstm = 256 

drop__1stm = 0.2 

Our LSTM model architecture is also the same as our RNN architecture, except that we 
replaced the SI mpl eRNN () layer with LSTM (); see Example 11.37. 

Example 11.37 LSTM sentiment classifier architecture 

from keras.layers import LSTM 
model = Sequential() 

model.add(Embedding(n_unique_words, n_dim, 

input_length=max_rev1ew_length)) 
model,add(SpatialDropout1D(drop_embed)) 
model,add(LSTM(n_lstm, dropout=drop_lstm)) 
model.add(Dense(1, acti vation= 'sigmoid' )) 

The results of training the LSTM are provided in full in our LSTM Sentiment Classifier 
notebook. To summarize, training loss decreased steadily epoch over epoch, suggesting 
that model-fitting proceeded more conventionally than with our simple RNN. The 
results are not a slam dunk, however. Despite its relative sophistication, our LSTM per- 
formed only as well as our baseline dense model. The LSTMs epoch with the lowest 
validation loss is the second one (0.349); it had a validation accuracy of 84.8 percent and 
an ROC AUC of 92.8 percent. 

Bidirectional LSTMs 

Bidirectional LSTMs (or Bi-LSTMs, for short) are a elever variation on Standard LSTMs. 
Whereas the latter involve backpropagation in only one direction (typically backward over 
timesteps, such as from the end of a movie review toward the beginning), Wdirectional 
LSTMs involve backpropagation in both directions (backward andforward over timesteps) 
across some one-dimensional input. This extra backpropagation doubles computa- 
tional complexity, but if accuracy is paramount to your application, it is often worth it: 
Bi-LSTMs are a popular choice in modern NLP applications because their ability to 
learn patterns both before and after a given token within an input document facilitates 
high-performing models. 
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Converting our LSTM architecture (Example 11.37) into a Bi-LSTM architecture is 
painless. We need only wrap our LSTM () layer within the Bi di recti onal () wrapper, as 
shown in Example 11.38. 

Example 11.38 Bidirectional LSTM sentiment classifier architecture 

from keras.layers import LSTM 

from keras. 1 ayers.wrappers import Bidirectional # new! 
model = Sequential() 

model. add (Embeddi ng (n_uni que_jvords , n_dim, 

input_length=max_review_length)) 
model.add(SpatialDropoutlD(drop_embed)) 

model.add(Bidi rectional(LSTM(n_lstm, dropout=drop_lstm))) 
model.add(Dense(1, acti vation= 'sigmoid' )) 

The straightforward conversion from LSTM to Bi-LSTM yielded substantial per- 
formance gains, as the results of model-fitting show (provided in full in our Bi LSTM 
Sentiment Classifier notebook). The epoch with the lowest validation loss (0.331) was the 
fourth, which had validation accuracy of86.0 percent and an ROC AUC of 93.5 per- 
cent, making it our second-best model so far as it trails behind only our convolutional 
architecture. 

Stacked Recurrent Models 

Stacking multiple RNN-family layers (be they Si mpl eRNN (), LSTM, or another type) is 
not quite as straightforward as stacking dense or convolutional layers in Keras—although 
it certainly isn’t difficult: It requires only specifying an extra argument when the layer is 
defined. 

As we’ve discussed, recurrent layers take in an ordered sequence of inputs. The recurrent 
nature of these layers comes from their processing each timestep in the sequence and pass- 
ing along a hidden state as an input to the next timestep in the sequence. Upon reaching 
the final timestep in the sequence, the output of a recurrent layer is the final hidden state. 

So in order to stack recurrent layers, we use the argument return_sequences=True. 
This asks the recurrent layer to return the hidden States for each step in the layer’s se¬ 
quence. The resulting output now has three dimensions, matching the dimensions of the 
input sequence that was fed into it. The default behavior of a recurrent layer is to pass 
only the final hidden state to the next layer. This works perfectly well if we’re passing 
this information to, say, a dense layer. If, however, we’d like the subsequent layer in our 
network to be another recurrent layer, that subsequent recurrent layer must receive a 
sequence as its input. Thus, to pass the array of hidden States from across all individual 
timesteps in the sequence (as opposed to only the single final hidden state value) to this 
subsequent recurrent layer, we set the optional return_sequences argument to True. 47 


47. There is also a return_state argument (which, like return_sequences, defaults to False) that asks the 
network to return the final cell state in addition to the final hidden state. This optional argument is not used as 
often, but it is useful when we’d like to initialize a recurrent layer s cell state with that of another layer, as we do in 
“encoder-decoder” models (introduced in the next section). 
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To observe this in action, check out the two-layer Bi-LSTM model shown in 
Example 11.39. (Notice that in this example we stili leave the final recurrent layer with its 
default return_sequences=False so that only the final hidden state ofthis fmal recurrent 
layer is returned for use further downstream in the network.) 

Example 11.39 Stacked recurrent model architecture 

from keras.layers import LSTM 

from keras.1ayers.wrappers import Bidirectional 

model = Sequential() 

model,add(Embedding(n_unique_words, n_dim, 

input_length=max_review_length)) 
model.add(SpatialDropout1D(drop_embed)) 

model.add(Bidi rectional(LSTM(n_lstm_1, dropout=drop_lstm, 

return_sequences=True))) # new! 

model.add(Bidi rectional(LSTM(n_1stm_2, dropout=drop_lstm))) 

model.add(Dense(1, acti vation= 'sigmoid' )) 

As you’ve discovered a number of times since Chapter 1 of this book, additional layers 
within a neural network model can enable it to leam increasingly complex and abstract 
representations. In this case, the abstraction facilitated by the supplementary Bi-LSTM 
layer translated to performance gains. The stacked Bi-LSTM outperformed its unstacked 
cousin by a noteworthy margin, with an ROC AUC of 94.9 percent and validation accu- 
racy of 87.8 percent in its best epoch (the second, with its validation loss of 0.296). The 
full results are provided in our Stacked Bi LSTM Sentiment Classifier notebook. 

The performance of our stacked Bi-LSTM architecture, despite being considerably 
more sophisticated than our convolutional architecture and despite being designed specifi- 
cally to handle sequential data like natural language, nevertheless lags behind the accuracy 
of our ConvNet model. Perhaps some hyperparameter experimentation and fine-tuning 
would yield better results, but ultimately our hypothesis is that because the IMDb film 
review dataset is so small, our LSTM models don’t have an opportunity to demonstrate 
their potential. We opine that a much larger natural language dataset would facilitate 
effective backpropagation over the many timesteps associated with LSTM layers. 48 


A relative of the LSTM within the family of RNNs is the gated recurrent unit 
(GRU). 49 GRUs are slightly less computationally intensive than LSTMs because they 
involve only three activation functions, and yet their performance often approaches 
the performance of LSTMs. If a bit more compute isn’t a deal breaker for you, we see 


48. If you’d like to test our hypothesis yourself, we provide appropriate sentiment analysis dataset suggestions in 
Chapter 14. 

49. Cho, K., et al. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine 
translation. arXiv: 1406.1078. 
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little advantage in choosing a GRU over an LSTM. If youre interested in trying a 
GRU in Keras anyway, it’s as easy as importing the GRU () layer type and dropping it 
into a model architecture where you might otherwise place an LSTM () layer. Check 
out our GRU Sentiment Classifer notebook for a hands-on example. 


Seq2seq and Attention 

Natural language techniques that involve so-called sequence-to-sequence (seq2seq; pro- 
nounced “seek-to-seek”) models take in an input sequence and generate an output 
sequence as their product. Neural machine translaticii (NMT) is a quintessential class of 
seq2seq models, with Google Translates machine-translation algorithm serving as an 
example of NMT being used in a production system. 50 

NMTs consist of an encoder-decoder structure, wherein the encoder processes the input 
sequence and the decoder generates the output sequence. The encoder and decoder are 
both RNNs, and so during the encoding step there exists a hidden state that is passed 
between units of the RNN. At the end of the encoding phase, the final hidden state is 
passed to the decoder; this final state can be referred to as the “context.” In this way, 
the decoder starts with a context for what is happening in the input sequence. Although 
this idea is sound in theory, the context is often a bottleneck: It’s difficult for models to 
handle really long sequences, and so the context loses its punch. 

Attention was developed to overcome the computational bottleneck associated with 
context . 51 In a nutshell, instead ofpassing a single hidden state vector (the final one) from 
the encoder to the decoder, with attention we pass the full sequence of hidden States to 
the decoder. Each of these hidden States is associated with a single step in the input se¬ 
quence, although the decoder might need the context from multiple steps in the input 
to inform its behavior at any given step during decoding. To achieve this, for each step 
in the sequence the decoder calculates a score for each of the hidden States from the en¬ 
coder. Each encoder hidden state is multiplied by the softmax of its score . 52 This serves to 
amplify the most relevant contexts (they would have high scores, and thus higher softmax 
probabilities) while muting the ones that aren’t relevant; in essence, attention weights 
the available contexts for a given timestep. The weighted hidden States are sumined, and 
this new context vector is used to predict the output for each timestep in the decoder 
sequence. Following this approach, the model selectively reviews what it knows about 
the input sequence and uses only the relevant information where necessary to inform the 
output. Its paying attention to the most relevant elements of the whole sentence! 

If this book were dedicated solely to NLP, we’d have at least a chapter covering 
seq2seq and attention. As it stands, we’ll have to leave it to you to further explore these 
techniques, which are raising the bar of the performance of rnany NLP applications. 


50. Google Translate has incorporated NMT since 2016. You can read more about it at bi t. 1 y/transi ateNMT. 

51. Bahdanau, D., et al. (2014). Neural machine translation by jointly learning to align and translate. 
arXiv: 1409.0473. 

52. Recall from Chapter 6 that the softmax fiinction takes a vector of real numbers and generates a probability 
distribution with the same number of classes as the input vector. 
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Transfer Learning in NLP 

Machine vision practitioners have for a number of years been helped along by the ready 
availability of nuanced models that have been pretrained on large, rich datasets. As cov- 
ered in the “Transfer Learning” section near the end of Chapter 10, casual users can 
download model architectures with pretrained weights and rapidly scale up their particular 
vision application to a state-of-the-art model. Well, more recently, such transfer learning 
has become readily available for NLP, too. 53 

First came ULMFiT (i/niversal /anguage model ^ine-funing), wherein tools were 
described and open-sourced that enabled others to use a lot of what the model learns 
during pretraining. 54 I 11 this way, models can be fine-tuned on task-specific data, thus 
requiring less training time and fewer data to attain high-accuracy results. 

Shortly thereafter, ELMo (embeddings from /anguage models) was revealed to the 
world. 55 In this update to the Standard word vectors we introduced in this chapter, the 
word embeddings are dependent not only on the word itself but also on the context in 
which the word occurs. In place of a fixed word embedding for each word in the dic- 
tionary, ELMo looks at each word in the sentence before assigning each word a specific 
embedding. The ELMo model is pretrained on a very large corpus; ifyou had to train it 
yourself, it would likely strain your compute resources, but you can now nevertheless use 
it as a component in your own NLP models. 

The fmal transfer learning development we ll mention is the release of BERT 
(fai-directional rncoder representations from fransformers) from Google. 56 Perhaps even 
more so than ULMFiT and ELMo, pretrained BERT models tuned to particular NLP 
tasks have been associated with the achievement of new state-of-the-art benchmarks 
across a broad range of applications, while requiring much less training time and fewer 
data to get there. 

Non-sequential Architectures: The Keras Func¬ 
tional API 


To solve a given problem, there are countless ways that the layer types weVe already cov- 
ered in this book can be recombined to form deep learning model architectures. For 
example, see our Conv LSTM Stack Sentiment Classifier notebook, wherein we were 
extra Creative in designing a model that involves a convolutional layer passing its 


f 53. When we introduced Keras Embeddi ng () layers earlier in this chapter, we touched on transfer learning with 
word vectors. The transfer learning approaches covered in this section—ULMFiT, ELMo, and BERT—are closer 
in spirit to the transfer learning of machine vision, because (analogous to the hierarchical visual features that 
are represented by a deep CNN; see Figure 1.17) they allow for the hierarchical representation of the elements 
of natural language (e.g., subwords, words, and context, as in Figure 2.9). Word vectors, in contrast, have no 
hierarchy; they capture only the word level of language. 

54. Howard, J., and Ruder, S. (2018). Universal language model fine-tuning for text classification. 
arXiv:1801.06146. 

55. Peters, M.E., et al. (2018). Deep contextualized word representations. arXiv: 1802.05365. 

56. Devlin, J., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. 
arXiv: 0810.04805. 
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activations into a Bi-LSTM layer. 57 Thus far, however, our creativity has been con- 
strained by our use of the Keras Sequenti al () model, which requires each layer to flow 
direcdy into a following one. 

Although sequential models constitute the vast majority of deep learning models, there 
are times when non-sequential architectures—which permit infinite model-design pos- 
sibilities and are often more complex—could be warrantedP 8 In such situations, we can 
take advantage of the Keras functional API, which makes use ofthe Model class instead of 
the Sequential models weVe worked with so far in this book. 

As an example of a non-sequential architecture, we decided to riff on our highest- 
performing sentiment classifier, the convolutional model, to see if we could squeeze 
more juice out ofthe proverbial lemon. As diagrammed in Figure 11.27, our idea was 
to have three parallel streams of convolutional layers—each of which takes in word vec- 
tors from an Embedding() layer. As in our Convolutional Sentiment Classifier notebook, 



E y 

Figure 11.27 A non-sequential model architecture: Three parallel streams of 
convolutional layers—each with a unique filter length (k = 2, k = 3, or k = 4)—receive 
input from a word-embedding layer. The activations of ali three streams are concatenated 
together and passed into a pair of sequentially stacked dense hidden layers en route to 

the sigmoid output neuron. 


57. This conv-LSTM model approached the validation accuracy and ROC AUC of our Stacked Bi-LSTM archi¬ 
tecture, but each epoch trained in 82 percent less time. 

58. Popular aspects of non-sequential models include having multiple model inputs or outputs (potentially at 
different levels within the architecture; e.g., a model could have an additional input or an additional output midway 
through the architecture), sharing the activations of a single layer with multiple other layers, and creating directed 
acyclic graphs. 
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one of these streams would have a filter length of three tokens. One of the others will 
have a filter length of two —so it will specialize in learning word-vector pairs that appear 
to be relevant to classifying a film review as having positive or negative sentiment. The 
third convolutional stream will have a filter length offour tokens, so it will specialize in 
detecting relevant quadruplets of word meaning. 

The hyperparameters for our three-convolutional-stream model are provided in Exam- 
ple 11.40 as well as in our Multi ConvNet Sentiment Classifier Jupyter notebook. 

Example 11.40 Multi-ConvNet sentiment classifier hyperparameters 

# output directory name: 
output_dir = 'model_output/multiconv' 

# trairring: 

epochs = 4 
batch_size = 128 

# vector-space embedding: 
n_dim = 64 

n_unique_words = 5000 
max_review_length = 400 
pad_type = trunc_type = 'pre' 
drop_embed =0.2 

# convolutional layer architecture: 

n_conv_1 = n_conv_2 = n_conv_3 = 256 
k_conv_1 = 3 
k_conv_2 = 2 
k_conv_3 = 4 

# dense layer architecture: 
n_dense = 256 

dropout =0.2 

The novel hyperparameters are associated with the three convolutional layers. All three 
convolutional layers have 256 fdters, but mirroring the diagram in Figure 11.27, the layers 
forni parallel streams—each with a unique filter length (k) that ranges from 2 up to 4. 

The Keras code for our multi-ConvNet model architecture is provided in 
Example 11.41. 

Example 11.41 Multi-ConvNet sentiment classifier architecture 

from keras.models import Model 

from keras.layers import Input, concatenate 
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# input layer: 

input_layer = Input(shape=(max_review_length,) , 
dtype=' intl6' , name=' input' ) 


# embedding: 

embedding_l ayer = Embedding(n_unique_words, n_dim, 

name= 'embedding' )(input_iayer) 
drop„embed_layer = SpatiaiDropoutlD(drop_embed, 

name= 'drop_embed' )(embedding_layer) 

# three paralie 7 convolutional streams: 

conv_1 = ConvlD(n_conv_1, k_conv_1, 

acti vation=' reiu' , name=' conv_1' )(drop_embed_layer) 
maxp_1 = G1obalMaxPoolinglD(name= 'maxp_1 ')(conv_1) 

conv_2 = Convl D(n_conv_2, k_conv_2, 

acti vati on=' rei u' , name=' conv_2' )(drop_embed_layer) 
maxp_2 = G1obalMaxPoolinglD(name= 'maxp_2 ')(conv_2) 

conv_3 = Convl D(n_conv_3, k_conv_3, 

acti vati on=' rei u' , name=' conv_3' )(drop_embed_layer) 
maxp_3 = G1obalMaxPoolinglD(name= 'maxp_3 ')(conv_3) 

# concatenate the activations from the three streams: 

concat = concatenate([maxp_1, maxp_2, maxp_3]) 

# dense hidden layers: 

dense_layer = Dense(n_dense, 

acti vation= 'reiu' , name= 'dense' )(concat) 
drop_dense_layer = Dropout(dropout, name= 'drop_dense' )(dense_layer) 
dense_2 = Dense(int(n_dense/4), 

acti vation=' reiu' , name=' dense_2' )(drop_dense_l ayer) 
dropout_2 = Dropout(dropout, name=' drop_dense_2' )(dense_2) 

# sigmoid output layer: 

predictions = Dense(1, acti vation=' sigmoid' , name=' output ')(dropout_2) 

# create model: 

model = Model(input_layer, predictions) 

This architecture may look a little alarming ifyou haven’t seen the Keras Model class used 
before, but as we break it down line-by-line here, it should lose any intimidating aspects it 
might have: 

■ With the Model class, we specify the Input () layer independently, as opposed to 
specifying it as the shape argument of the first hidden layer. We specified the data 
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type (dtype) explicitly: 16-bit integers (int16) can range up to 32,767, which 
will accommodate the maximum index of the words we input. 59 As with ali of 
the layers in this model, we specify a recognizable name argument so that when 
we print the model later (using model . summary()) it will be easy to make sense of 
everything. 

■ Every layer is assigned to a unique variable name, such as i nput_l ayer, 
embeddi ng_l ayer, and conv_2. We will use these variable names to specify the 
flow of data within our model. 

■ The most noteworthy aspect of using the Model class, which will be familiar to 
developers who have worked with functional programming languages, is the 
variable name within the second set of parentheses following any layer call. 

This specifies which layer’s outputs are flowing into a given layer. For example, 
(input_layer) in the second set of parentheses ofthe embeddi ng_l ayer indicates 
that the output of the input layer flows into the embedding layer. 

■ The Embeddi ng () and Spati al Dropoutl D layers take the sarne arguments as before 
in this chapter. 

■ The output of the Spati al Dropoutl D layer (with a variable named 
drop_embed_l ayer) is the input to three separate, parallel convolutional layers: 
conv_1, conv_2, and conv_3. 

■ As per Figure 11.27, each of the three convolutional streams includes a Convl D layer 
(with a unique k_conv filter length) and a G1 obal MaxPool i ngl D layer. 

■ The activations output by the G1 obal MaxPool i ngl D layer of each of the three con¬ 
volutional streams are concatenated into a single array of activation values by the 
concatenate( ) layer, which takes in a list ofinputs ([maxp_1 , maxp_2 , maxp_3]) 
as its only argument. 

■ The concatenated convolutional-stream activations are provided as input to two 
Dense () hidden layers, each of which has a Dropout () layer associated with it. 

(The second dense layer has one-quarter as many neurons as the first, as specified by 
n_dense/4.) 

■ The activations output by the sigmoid output neuron ( y ) are assigned to the vari¬ 
able name predictions. 

■ Finally, the Model class ties all of the modeTs layers together by taking two argu- 
ments: the variable name of the input layer (i.e., i nput_l ayer) and the output layer 
(i.e., predictions). 

Our elaborate parallel network architecture ultimately provided us with a modest 
bump in capability to give us the best-performing sentiment classifier in this chapter (see 
Table 11.6). As detailed in our Multi ConvNet Sentiment Classifier notebook, the lowest 
validation loss was attained in the second epoch (0.262), and this epoch was associated 
with a validation accuracy of 89.4 percent and an ROC AUC of 96.2 percent—a tenth of 
a percent better than our Sequenti al convolutional model. 


59. The index goes up to only 5,500, because of the n_unique_words and n_words_to_ski p hyperparameters 
we selected. 
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Table 11.6 Comparison of the performance of our sentiment classifier model 
architectures 


Model 

ROC AUC (%) 

Dense 

92.9 

Convolutional 

96.1 

Simple RNN 

84.9 

LSTM 

92.8 

Bi-LSTM 

93.5 

Stacked Bi-LSTM 

94.9 

GRU 

93.0 

Conv-LSTM 

94.5 

Multi-ConvNet 

96.2 


Summary 

In this chapter, we discussed methods for preprocessing natural language data, ways to 
create word vectors from a corpus of natural language, and the procedure for calculat- 
ing the area under the receiver operating characteristic curve. In the second half of the 
chapter, we applied this knowledge to experiment with a wide range of deep learning 
NLP models for classifying film reviews as favorable or negative. Some of these models 
involved layer types you were familiar with from earlier chapters (i.e., dense and convo- 
lutional layers), while later ones involved new layer types from the RNN family (LSTMs 
and GRUs) and, for the first time in this book, a non-sequential model architecture. 

A summary of the results of our sentiment-classifier experiments are provided in 
Table 11.6. We hypothesize that, had our natural language dataset been much larger, the 
Bi-LSTM architectures might have outperformed the convolutional ones. 
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Key Concepts 


Here are the essential foundational 
chapter are highlighted in purple. 

■ parameters: 

■ weight w 

■ bias b 

■ activation a 

■ artificial neurons: 

■ sigmoid 

■ tanh 

■ ReLU 

■ linear 

■ input layer 

■ hidden layer 

■ output layer 

■ layer types: 

■ dense (fully connected) 

■ softmax 

■ convolutional 

■ max-pooling 

■ flatten 

■ enibedding 

■ RNN 

■ (bidirectional-)LSTM 

■ concatenate 


thus far. New ternis from the current 

■ cost (loss) functions: 

■ quadratic (mean squared 
error) 

■ cross-entropy 

■ forward propagation 

■ backpropagation 

■ unstable (especially vanishing) 
gradients 

■ Glorot weight initiahzation 

■ batch normahzation 

■ dropout 

■ optimizers: 

■ stochastic gradient descent 

■ Adam 

■ optimizer hyperparameters: 

■ learning rate 77 

■ batch size 

■ word 2 vec 


concepts 
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Generative Adversarial 

Networks 


Id ack in Chapter 3, we introduced die idea of deep learning models that can create 
novel and unique pieces of visual imagery—images that we might even be able to call art. 
In this chapter, we combine the high-level theory from Chapter 3 with the convolutional 
networks from Chapter 10, the Keras Model class from Chapter 11, and a couple of new 
layer types, enabling you to code up a generative adversarial network (GAN) that outputs 
images in the style of sketches hand drawn by humans. 

Essential GAN Theory 

At its highest level, a GAN involves two deep learning networks pitted against each other 
in an adversarial relationship. As depicted by the trilobites in Figure 3.4, one network is 
a. generator that produces forgeries of images, and the other is a discriminator that attempts 
to distinguish the generators fakes from the real thing. Moving from trilobites to slightly 
more-technical schematic sketches, the generator is tasked with receiving a random 
noise input and turning this into a fake image, as shown on the left in Figure 12.1. The 
discriminator—a binary classifier ofreal versus fake images—is shown in Figure 12.1 on 
the right. (The schematics in this figure are highly simplified for illustrative purposes, but 
we’ll go into more detail shortly.) Over several rounds of training, the generator becomes 
better at producing more-convincing forgeries, and so too the discriminator improves its 
capacity for detecting the fakes. As training continues, the two models battle it out, trying 
to outdo one another, and, in so doing, both models become more and more specialized 
at their respective tasks. Eventually this adversarial interplay can culminate in the genera¬ 
tor producing fakes that are convincing not only to the discriminator network but also to 
the human eye. 

Training a GAN consists of two opposing ( adversarial!) processes: 

1. Discriminator training'. As mapped out in Figure 12.2, in this process the generator 
produces fake images—that is, it performs inference only 1 —while the discriminator 
learns to teli the fake images from real ones. 


1. Inference is forward propagation alone. It does not involve model training (via, e.g., backpropagation). 
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GENERATOR DISCRIMINATOR 


31 

Z 

Figure 12.1 Highly simplified schematic diagrams of the two models that make up a 
typical GAN: the generator (left) and the discriminator (right) 

2. Generator training : As depicted in Figure 12.3, in diis process the discriminator 
judges fake images produced by the generator. Here, it is the discriminator that per- 
forms inference only, whereas its the generator that uses this information to leam —in 
this case, to learn how to better fool the discriminator into classifying fake images as 
real ones. 

Thus, in each of these two processes, one of the models creates its output (either a fake 
image or a prediction of whether the iniage is fake) but is not trained, and the other 
model uses that output to learn to perform its task better. 

During the overall process of training a GAN, discriminator training alternates with 
generator training. Lefs dive into both training processes in a bit more detail, starting 
with discriminator training (see Figure 12.2): 

■ The generator produces fake images (by inference; shown in black) that are mixed 
in with batches of real images and fed into the discriminator for training. 

■ The discriminator outputs a prediction ( y ) that corresponds to the likelihood that 
the image is real. 

■ The cross-entropy cost is evaluated for the discriminators y predictions relative to 
the true y labeis. 

■ Via backpropagation tuning the discriminators parameters (shown in green), the 
cost is minimized in order to train the model to better distinguish real images from 
fake ones. 

Note well that during discriminator training, it is only the discriminator network that is 
learning; the generator network is not involved in the backpropagation, so it doesn’t learn 
anything. 

Now let’s turn our focus to the process that discriminator training alternates with: the 
training ofthe generator (shown in Figure 12.3): 

■ The generator receives a random noise vector z as input 2 and produces a fake image 
as an output. 

2. This random noise vector 2 : corresponds to the latent-space vector introduced in Chapter 3 (see Figure 3.4), and 
it is unrelated to the z variable that has been used since Figure 6.8 to represent w • x + b. We cover this in more 
detail later on in this chapter. 
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TRAINING THE DISCRIMINATOR 


Generator 


fake 

image 
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y vs.y 
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Figure 12.2 This is an outline of the discriminator training loop. Forward propagation 
through the generator produces fake images. These are mixed into batches with real 
images from the dataset and, together with their labeis, are used to train the 
discriminator. Learning paths are shown in green, while non-learning paths are shown in 
black and the blue arrow calls attention to the image labeis, y. 


fRAINING THE GENERATOR 


y=l 


Generator 


fake 

image 


Discriminator -* y vs. y 


modcl.trainablc=falsc 


Backpropagation 


Figure 12.3 An outline of the generator training loop. Forward propagation through the 
generator produces fake images, and inference with the discriminator scores these 
images. The generator is improved through backpropagation. As in Figure 12.2, learning 
paths are shown in green, and non-learning paths are shown in black. The blue arrow 
calls attention to the relationship between the image and its label y which, in the case of 
generator training, is always equal to 1. 
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■ The fake images produced by the generator are fed directly into the discriminator 
as inputs. Crucially to this process, we lie to the discriminator and label all of these 
fake images as real (y = 1). 

■ The discriminator (by inference; shown in black) outputs y predictions as to 
whether a given input image is real or fake. 

■ Cross-entropy cost here is used to tune the parameters of the generator network 
(shown in green). More specifically, the generator learns how convincing its fake 
images are to the discriminator network. By minimizing this cost, the generator 
will learn to produce forgeries that the discriminator mistakenly labeis as real— 
forgeries that may even appear to be real to the human eye. 

So, during generator training, it is only the generator network that is learning. Later in 
this chapter, we show you how to freeze the discriminators parameters so that backprop- 
agation can tune the generator’s parameters without influencing the discriminator in any 
way. 

At the onset of GAN training, the generator has no idea yet what it’s supposed to 
be making, so—being fed random noise as inputs—the generator produces images of 
random noise as outputs. These poor-quality fakes contrast starkly with the real images— 
which contain combinations of features that blend to forni actual images—and therefore 
the discriminator initially has no trouble at all learning to distinguish real from fake. 

As the generator trains, however, it gradually learns how to replicate some of the struc¬ 
ture of the real images. Eventually, the generator becomes crafty enough to fool the 
discriminator, and thus in turn the discriminator learns more-complex and nuanced 
features from the real images such that outwitting the discriminator becomes trickier. 

Back and forth, alternating between generator training and discriminator training in this 
way, the generator learns to forge ever-more-convincing images. At some point, the two 
adversarial models arrive at a stalemate: They reach the limits of their architectures, and 
learning stalls on both sides. 3 

At the conclusion of training, the discriminator is discarded and the generator is our 
final product. We can feed in random noise, and it will output images that match the style 
of the images the adversarial network was trained on. In this sense, the generative capacity 
of GANs could be considered Creative. If provided with a large training dataset of pho- 
tos of celebrity faces, a GAN can produce convincing photos of “celebrities” that have 
never existed. As in Figure 3.4, by passing specific z values into this generator, we would 
be specifying particular coordinates within the GANs latent space, enabling us to output 
a celebrity face with whatever attributes we desire—such as a particular age, gender, or 
type of eyeglasses. In the GAN that you’ll train in this chapter, you’ll use a training dataset 
consisting of sketches hand drawn by humans, so our GAN will learn to produce novel 
drawings—ones that no human mind has conceived of before. Hold tight for that section, 
where we discuss the specific architectures of the generator and discriminator in more 
detail. First, though, we describe how to download and process these sketch data. 


3. More-complex generator and discriminator networks would learn more-complex features and produce more- 
realistic images. However, in some cases we don’t need that complexity, and of course these models would be harder 
to train. 
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Figure 12.4 Example of sketches drawn by humans who have played the Quick, Draw! 
game. Baseballs, baskets, and bees—oh my! 

The Quick, Draw! Dataset 

At the conclusion of Chapter 1, we encouraged you to play a round of the Quick, Draw! 
game. 4 Ifyou did, then you contributed to the worlds largest dataset of sketches. At the 
time of this writing, the Quick, Draw! dataset consists of 50 million drawings across 345 
categories. Example drawings from 12 of these categories are provided in Figure 12.4, 
including from the categories of ant, anvil, and apple. The GAN we’ll build in this chapter 
will be trained on images from the apple category, but youre welcome to choose any 
category you fancy. You could even train on several categories simultaneously if youre 
feeling adventurous! 5 

The GitHub repository of the Quick, Draw! game dataset can be accessed via 
bi t. 1 y/QDreposi tory. The data are available in several formats there, including as 
raw and unmoderated images. In the interest of having relatively uniform data, we rec- 
ommend using preprocessed data, which are centered and scaled doodles, among other 
more-technical adjustments. Specifically, for the simphcity of working with the data in 
Python, we recommend selecting the NumPy-formatted bitmaps of the preprocessed 
data. 6 

We downloaded the appl e. npy file, but you could pick any category that you de- 
sire for your own GAN. The contents of our Jupyter working directory are shown in 
Figure 12.5 with the data fde stored here: 

/deep-learning-i11ustrated/quickdraw_data/appl es,npy 


4. quickdraw.withgoogle.com 

5. Ifyou have a lot of compute resources available to you (we’d recommend multiple GPUs), you could train a 
GAN on the data from ali 345 sketch categories simultaneously. We haven’t tested this, so it really would be an 
adventure. 

6. These particular data are available at bi t. ly/QDdata for you to download. 
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Figure 12.5 The directory structure inside the Docker Container that is running 
Jupyter. We put our quickdraw_data directory (for storing Quick, Draw! NumPy bitmaps) 


at the same level as our notebooks directory (which contains all of the Jupyter notebooks 

we’ve been running in this book). 


Youre welcome to store the data elsewhere, and youre welcome to change the file- 
name (especially ifyou downloaded a category other than apples ); ifyou do, however, be 
mindful that youTl need to update your data-loading code (coming up in Example 12.2) 
accordingly. 

The first step, as you should be used to by now, is to load the package dependencies. 
For our Generative Adversarial NetWork notebook, these dependencies are provided in 
Example 12.1. 7 

Example 12.1 Generative adversarial network dependencies 

# for data input and output: 

import numpy as np 
import os 


# for deep learning: 

import keras 

from keras.models import 
from keras.layers import 
from keras.layers import 
from keras.layers import 
from keras.layers import 


Model 

Input, Dense, Conv2D, Dropout 
BatchNormalization, Flatten 
Acti vation 
Reshape # new! 


7. Our GAN architecture is based on Rowel Atienza’s, which you can check out in GitHub via bi t. 1 y/mni stGAN. 
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from keras.Iayers import Conv2DTranspose, UpSampling2D # new! 
from keras.optimizers import RMSprop # new! 

# for plotting: 

import pandas as pd 

from matplotlib import pyplot as plt 

%matp1otlib inline 

AII of these dependencies have popped up previously in this book except for three new 
layers and the RMSProp optimizer, 8 which we’ll go over as we design our model archi- 
tecture. 

Okay, now back to loading the data. Assuming you set up your directory structure the 
sanae as ours and downloaded the appl e . npy file, you can load these data in using the 
command in Exaniple 12.2. 

Example 12.2 Loading the Quick, Draw! data 

input_images = "../quickdraw_data/apple.npy" 
data = np.1oad(input_images) 

Again, if your directory structure is different froni ours or you selected a different cate- 
gory of NuniPy images from the Quick, Draw! dataset, then youTl need to amend the 
i nput_i mages path variable to your particular circunastance. 

Running data . shape outputs the two dimensions of your training data. The first 
dimension is the number of images. At the time of this writing, the apples category had 
145,000 images, but there are likely to be more by the time youre reading this. The 
second dimension is the pixel count of each inaage. This value is 784, which should be 
familiar because—like the MNIST digits—these images have a 28x28-pixel shape. 

Not only do these Quick, Draw! images have the same dimensions as the MNIST dig¬ 
its, but they are also represented as 8-bit integers, that is, integers ranging from 0 to 255. 
You can examine one—say, the 4,243rd image—by executing data [4242] . Because the 
data are stili in a one-dimensional array, this doesn’t show you much. You should reformat 
the data as follows: 

data = data/255 

data = np.reshape(data,(data.shape[0], 28 , 28 ,1)) 
img_w,img_h = data.shape[1: 3 ] 

Let’s examine this code line by line: 

■ We divide by 255 to scale our pixels to be in the range of 0 to 1, just as we did for 
the MNIST digits. 9 


8. We introduced RMSProp in Chapter 9. Skip back to the section “Fancy Optimizers” ifyou’d like a refresher. 

9. See the footnote near Example 5.4 for an explanation as to why we scale in this way. 
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Figure 12.6 This example bitmap is the 4,243rd sketch from the apple category of the 

Quick, Draw! dataset. 



■ The first hidden layer of our discriminator network will consist of two-dimensional 
convolutional filters, so we convert the images from 1 x 784-pixel arrays to 28 x 28- 
pixel matrices. The NumPy reshape() method does this for us. Note that the 
fourth dimension is 1 because the images are monochromatic; it would be 3 if the 
images were full-color. 

■ We store the image width (i mg_w) and height (i mg_h) for use later. 

Figure 12.6 provides an example ofwhat our reformatted data look like. We printed 
that example—a bitmap of the 4,243rd sketch from the apple category—by running this 
code: 

plt.imshow(data[4242,:,:,0], cmap=' Greys' ) 


The Discriminator Network 


Our discriminator is a fairly straightforward convolutional network, involving the Conv2D 
layers detailed in Chapter 10 and the Model class introduced at the end of Chapter 11. 

See the code in Example 12.3. 

Example 12.3 Discriminator model architecture 

def build_discriminator(depth=64, p=0.4): 

# De fi ne inputs 

image = Input((img_w,img_h,1)) 

# Convolutional layers 

convl = Conv2D(depth*1, 5, strides=2, 

padding=' same' , acti vation= 'reiu' )(image) 
convl = Dropout(p)(convl) 
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conv2 = Conv2D(depth*2, 5, strides=2, 

padding= 'same', acti vation=' reiu' )(convl) 
conv2 = Dropout(p)(conv2) 

conv3 = Conv2D(depth*4, 5, strides=2, 

padding= 'same' , acti vation=' reiu' )(conv2) 
conv3 = Dropout(p)(conv3) 

conv4 = Conv2D(depth*8, 5, strides=1, 

padding= 'same' , acti vation=' reiu' )(conv3) 
conv4 = FIatten()(Dropout(p)(conv4)) 

# Output layer 

prediction = Dense(1, acti vation=' sigmoid ')(conv4) 

# flodei de fi nition 

model = Modei(inputs=image, outputs=prediction) 
return modei 

For the first time in this book, rather than create a modei architecture directly 
we instead define a function (bui 1 d_di seri mi nator) that returns the constructed 
modei object. Considering the schematic of this modei in Figure 12.7 and the code in 
Example 12.3, lets break down each piece ofthe modei: 

■ The input images are 28x28 pixels in size. This is passed to the input layer by the 
vanables img_w and img_h. 

■ There are four hidden layers, and ali of them are convolutional. 

■ The number of convolutional filters per layer doubles layer-by-layer such that the 
first hidden layer has 64 convolutional filters (and therefore outputs an activation 
map with a depth of64), whereas the fourth hidden layer has 512 convolutional 
filters (corresponding to an activation map with a depth of 512). 10 

■ The filter size is held constant at 5 x 5. * 11 

■ The stride length for the first three convolutional layers is 2 X 2, which means that 
the activation map’s height and width are roughly halved by each of these layers 
(recall Equation 10.3). The stride length for the last convolutional layer is 1 X 1, so 
the activation map it outputs has the same height and width as the activation map 
input mto it (4x4). 

■ Dropout of 40 percent (p=0.4) is applied to every convolutional layer. 


10. More filters lead to more parameters and more modei complexity, but also contribute to greater sharpness in 
the images the GANs produce. These values work well enough for this example. 

11. WeVe largely used a filter size of 3 X 3 thus far in the book, although GANs can benefit from a slightly larger 
filter size, especially earlier in the network. 
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Figure 12.7 A schematic representation of our discriminator network for predicting 
whether an input image is real (in this case, a hand-drawn apple from the Quick, Draw! 
dataset) or fake (produced by an image generator) 


■ We flatten the three-diniensional activation map from the fmal convolutional layer 
so that we can feed it into the dense output layer. 

■ As with the film sentiment models in Chapter 11, discriminating real images from 
fakes is a binary classification task, so our (dense) output layer consists of a single 
sigmoid neuron. 

To build the discriminator, we call our bui 1 d_di seri mi nator function without any 

arguments: 


discriminator = bui1d_diseri mi nator() 
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A summary of the model architecture can be output by calling the models summary 
method, which shows that the model has a total of 4.3 million parameters, most ofwhich 
(76 percent) are associated with the final convolutional layer. 

Example 12.4 provides code for compiling the discriminator. 

Example 12.4 Compiling the discriminator network 

di seri mi nator.compi1e(ioss= 'binary_crossentropy' , 
optimizer=RMSprop(ir=0 .0008 , 
decay=6e-8, 
ci i pvaiue=1.0), 
metri cs=[ 'accuracy'] ) 

Let’s look at Example 12.4 line by line: 

■ As in Chapter 11, we use the binary cross-entropy cost function because the dis¬ 
criminator is a binary classification model. 

■ Introduced in Chapter 9, RMSprop is an alternative “fancy optimizer” to Adam. 12 

■ The decay rate (decay, p ) for the RMSprop optimizer is a hyperparameter 
described in Chapter 9. 

■ Finally, ci i pvai ue is a hyperparameter that prevents (i.e., clips) the gradient of 
learning (the partial-derivative relationship between cost and parameter values 
during stochastic gradient descent) from exceeding this value; ci i pvai ue thereby 
explicitly limits exploding gradients (see Chapter 9). This particular value of 1 .0 is 
common. 


The Generator Network 


Although the CNN architecture of the discriminator network should largely look 
familiar, the generator network contains a number of aspects that you haven’t encoun- 
tered previously in this book. The generator model is shown schematically in Figure 12.8. 

We refer to the generator as a rfeCNN because it features de-convolutional tayers (also 
known as convTmnspose layers) that perform the opposite function of the typical convo¬ 
lutional layers youve encountered so far. Instead of detecting features and outputting an 
activation map of where the features occur in an image, de-convolutional layers take in an 
activation map and arrange the features spatially as outputs. An early step in the generative 
network reshapes the noise input (a one-dimensional vector) into a two-dimensional array 
that can be used by the de-convolutional layers. Through several layers of de-convolution, 
the generator converts the random noise inputs into fake images. 


12. Ian Goodfellow and his colleagues published the first GAN paper in 2014. At the time, RMSProp was an 
optimizer already in vogue (the researchers Kingma and Ba published on Adam in 2014 as well, and it has become 
more popular in the years since). You might need to tune the hyperparameters a bit, but you could probably 
substitute RMSProp with Adam to similar efFect. 
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Figure 12.8 A schematic representation of our generator network, which takes in noise 
(in this case, representing 32 latent-space dimensions) and outputs a 28x28-pixel 
image. After training as part of an adversarial network, these images should resemble 
images from the training dataset (in this case, hand-drawn apples). 




The code to build the generator model is in Example 12.5. 

Example 12.5 Generator model architecture 

z_dimensions = 32 

def bui1d_generator (1atent_dim=z_dimensions, 
depth=64, p=0.4): 


# De fi ne inputs 

noise = Input((iatent_dim,)) 
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# First dense layer 

densel = Dense(7*7*depth)(noise) 

densel = BatchNormalization(momentum=0. 9) (densel) 

densel = Activation(activation=' reiu ')(densel) 

densel = Reshape( (7,7, depth))(densel) 

densel = Dropout(p)(densel) 

# De-Convolutional layers 

convl = UpSampiing2D()(densel) 
convl = Conv2DTranspose(int(depth/2), 

kernei_size=5, padding=' same', 
acti vati on=None ,)(convl) 
convl = BatchNormalization(momentum=0. 9) (convl) 
convl = Acti vation(acti vation=' reiu ')(convl) 

conv2 = UpSampiing2D()(convl) 
conv2 = Conv2DTranspose(int(depth/4), 

kernel_size=5, padding=' same' , 
acti vati on=None ,)(conv2) 
conv2 = BatchNormalization(momentum=0 .9) (conv2) 
conv2 = Acti vation(activation=' reiu ')(conv2) 

conv3 = Conv2DTranspose(int(depth/8) , 

kernel_size=5, padding=' same' , 
acti vati on=None ,) (conv2) 
conv3 = BatchNormalization(momentum=0. 9) (conv3) 
conv3 = Acti vation(activation=' reiu ')(conv3) 

# Output layer 

image = Conv2D(1, kernel_size=5, padding=’ same' , 
acti vation= 'sigmoid' )(conv3) 

# Model definition 

model = Model(inputs=noise, outputs=image) 
return model 

Let’s go through the architecture in detail: 

■ We specify the number of dimensions in the input noise vector (z_di mensions) 
as 32. Configuring this hyperparameter iollows the same advice we gave for 
selecting the number of dimensions in word-vector space in Chapter 11: A higher- 
dimensional noise vector has the capacity to store more information and thus can 
improve the quality of the GAN’s fake-image output; however, this comes at the 
cost of increased computational complexity. Generally, we recommend experi- 
menting with varying this hyperparameter by multiples of 2. 
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■ As with our discriminator model architecture (Example 12.3), we again opted to 
wrap our generator architecture within a function. 

■ The input is the random noise array with a length corresponding to latent_dim, 
which in this case is 32. 

■ The first hidden layer is a dense layer. This fully connected layer enables the latent- 
space input to be flexibly mapped to the spatial (de-convolutional) hidden layers 
that follow. The 32 input dimensions are mapped to 3,136 neurons in the dense 
layer, which outputs a one-dimensional array of activations. These activations are 
then reshaped into a 7x 7X64 activation map. This dense layer is the only layer in 
the generator where dropout is applied. 

■ The network has three de-convolutional layers (specified by Conv2DTranspose). 

The first has 32 filters, and this number is halved successively in the remaining 
two layers. 13 While the number of filters decreases, the size of the filters increases, 
thanks to the upsanipling layers (UpSampl i ng2D). Each time upsampling is applied 
(with its default parameters, as we use it here), both the height and the width of the 
activation map double. 14 All three de-convolutional layers have the following: 

■ 5x5 filter sizes 

■ Stride of 1 x 1 (the default) 

■ Padding set to same to maintain the dimensions of the activation maps after 
de-convolution 

■ ReLU activation functions 

■ Batch normalization applied (to promote regularization) 

■ The output layer is a convolutional layer that collapses the 28x28x8 activation 
maps into a single 28 X 28 X 1 image. The sigmoid activation function in this last step 
ensures that the pixel values range from 0 to 1, just like the data from real images 
that we feed into the discriminator separately. 

Exactly as we did with the discriminator network, we call the bui 1 d_generator function 
without supplying any arguments to build the generator: 

generator = build_generator() 

Calling the models summary method shows that the generator has only 177,000 trainable 
parameters—a mere 4 percent of the number of parameters in the discriminator. 

The Adversarial Network 


Combining the training processes from Figures 12.2 and 12.3, we arrive at the outline in 
Figure 12.9. By executing the code examples so far in this chapter, we have accomplished 
the following: 


13. As with convolutional layers, the number of filters in the layer corresponds to the number ofslices (the depth) 
of the activation map the layer outputs. 

14. This makes upsampling roughly the inverse ofpooling. 
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generator training loop 



Figure 12.9 Shown here is a summary of the whole adversarial network. The horizontal 
dashes visually separate generator training from discriminator training. Green lines 
indicate trainable paths, whereas inference-only paths are in black. The red arrows above 
and below indicate the path of the backpropagation step during the respective training 

processes. 


■ With respect to discriminator training (Figure 12.2), we’ve constructed our discrim¬ 
inator network and compiled it: It’s ready to be trained on real and fake images so 
that it can learn how to distinguish between these two classes. 

■ With respect to generator training (Figure 12.3), we’ve constructed our generator 
network, but it needs to be compiled as part of the larger adversarial network in 
order to be ready for training. 

To combine our generator and discriminator networks to build an adversarial network, 
we use the code in Example 12.6. 

Example 12.6 Adversarial model architecture 

z = Input(shape=(z_dimensions,)) 
img = generator(z) 
di seri mi nator.trainable = False 
pred = discriminator(img) 
adversarial_model = Model(z, pred) 
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Let’s break this code down: 

■ We use Input () to define the models input z, which will be an array of random 
noise of length 32. 

■ Passing z into generator returns a 28x28 image output diat we call f mg. 

■ For the purposes of generator training, the parameters of our discriminator network 
must be frozen (see Figure 12.3), so we set the discriminators trainable attribute 
to Fal se. 

■ We pass the fake img into the frozen di seri mi nator network, which outputs a 
prediction (pred) as to whether the image is real or fake. 

■ Finally, using the Keras functional APIs Model class, we construet the adversarial 
model. By indicating that the adversarial models input is z and its output is pred, 
the functional API determines that the adversarial network consists of the generator 
passing i mg into the frozen discriminator. 

To compile the adversarial network, we use the code in Example 12.7. 

Example 12.7 Compilingthe adversarial network 

adversarial_model.compi1e(loss= 'binary_crossentropy' , 

optimizer=RMSprop(lr=0. 0004 , 
decay=3e-8, 
clipvalue=1.0), 
metrics=[ 'accuracy 1 ] ) 

The arguments to the compi 1 e() method are the same as those we used for the discrim¬ 
inator network (see Example 12.4), except that the optimizer’s learning rate and decay 
have been halved. There s a somewhat delicate balance to be struck between the rate 
at which the discriminator and the generator learn in order for the GAN to produce 
compelling fake images. Ifyou were to adjust the optimizer hyperparameters ofthe dis¬ 
criminator model when compiling it, then you might fmd that you’d also need to adjust 
them for the adversarial model in order to produce satisfactory image outputs. 



A tricky aspect of the GAN training process that is worth restating is that the same 
discriminator network parameters (weights) are used during discriminator training and 
during adversarial training. The discriminator is not frozen across the board; it is only 
frozen when it’s a component of the adversarial model. In this way, during discrimina¬ 
tor training the weights are updated during backpropagation and the model learns to 
distinguish between real and fake images. The adversarial model, in contrast, was com- 
piled with a frozen discriminator. This discriminator is the exact same model with the 
same weights, but when the adversarial model learns it does not update the discriminator 
weights; it only updates the weights of the generator. 
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GAN Training 

To train our GAN, we call the (cleverly titled) function trai n, which is provided in 
Example 12.8. 

Example 12.8 GAN training 

def train (epochs=2000, batch=128, z_dim=z_dimensions): 

d_metrics = [] 
a__metrics = [] 

running_d_loss = 0 
running_d_acc = 0 
running_a_loss = 0 
running_a_acc = 0 

for i in range(epochs): 

# sample real images: 

real_imgs = np.reshape( 

data[np.random.choice(data.shape[0], 
batch, 

repi ace=Fal se) ], 

(batch, 28,28,1)) 

# generate fake images: 

fake_imgs = generator.predict( 
np .random.uniform( -1.0, 1.0, 

size=[batch, z_dim])) 

# concatenate images as di seri mi nator inputs: 

x = np.concatenate((real_imgs,fake_imgs)) 

# assign y labeis for discriminator: 

y = np.ones([2*batch , 1 ]) 
y[batch:, : ] =0 

# train di seri mi nator: 

d_metrics.append( 

di seri mi nator.train_on_batch(x,y) 

) 

running_d_loss += d_metrics[- 1 ] [0] 
running_d_acc += d_metrics[- 1 ][1] 
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# adversarial net's noise input and "real" y: 

noise = np.random.uniform( -1.0 , 1.0, 

size=[batch, z_dim]) 

y = np.ones([batch, 1 ]) 

# train adversarial net: 

a_metrics.append( 

adversarial_model.train_on_batch(noi se,y) 

) 

running_a_ioss += ajnetrics[- 1 ][0] 
running_a_acc += a_metrics[ 1 ][1 ] 

# peri odically print progress & fake images: 
if (i+1 )%100 == 0: 

print('Epoch #{}'. format(i)) 

iog_mesg = "%d: [D ioss: %f , acc: %f]" % \ 

(i, running_d_1oss/i , running_d_acc/i) 
log_mesg = "%s [A ioss: , acc: %f]" % \ 

(log_mesg, running_a_loss/i, running_a_acc/i) 
print(log_mesg) 

noise = np.random.uniform(- 1.0 , 1.0, 

size=[16, z_dim]) 

gen_imgs = generator.predict(noise) 

plt.figure(figsize=(5,5)) 

for k in range(gen_imgs.shape[0]): 
plt.subplot (4 , 4, k+1) 
pl t. i mshow(gen_imgs[k, :, :, 0], 
cmap= 'gray ' ) 
plt.axis( 'off ' ) 

pl t.tight_layout() 
pl t.show() 

return ajietrics, d_metrics 

# train the GAN: 

a_metrics_complete, d_metrics_complete = train() 


This is the largest single chunk of code in the book, so from top to bottom, let’s dissect it 
to understand it better: 
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■ The two empty lists (e.g., d_metrics) and the four variables set to 0 (e.g., 
runni ng_d_l oss) are for tracking loss and accuracy metrics for the discrimina¬ 
tor (d) and adversarial (a) networks as they train. 

■ We use the for loop to train for however many epochs we’d like. Note that while 
the term epoch is commonly used by GAN developers for this loop, it would be 
more accurate to call it a batch: During each iteration of the for loop, we will 
sample only 128 apple sketches from our dataset ofhundreds ofthousands of such 
sketches. 

■ Within each epoch, we alternate between discriminator training and generator 
training. 

■ To train the discriminator (as depicted in Figure 12.2), we: 

■ Sample a batch of 128 real images. 

■ Generate 128 fake images by creating noise vectors (z, sampled uniformly 
over the range [—1.0,1.0]) and passing them into the generator models 
predict method. Note that by using the predict method, the generator is 
only performing inference ; it is generating images without updating any of its 
parameters. 

■ Concatenate the real and fake images into a single variable x, which will serve 
as the input into our discriminator. 

■ Create an array, y, to label the images as real ( y = 1) or fake ( y = 0) so that 
they can be used to train the discriminator. 

■ To train the di seri mi nator, we pass our rnputs x and labeis y into the 
models train_on_batch method. 

■ After each round of training, the training loss and accuracy metrics are ap- 
pended to the d_metrics list. 

■ To train the generator (as in Figure 12.3), we: 

■ Pass random noise vectors (stored in a variable called noi se) as inputs as well 
as an array (y) of all-real labeis (i.e., y = 1) into the trai n_on_batch method 
of the adversarial model. 

■ The generator component of the adversarial model converts the noi se inputs 
into fake images, which are automatically passed as inputs into the discrimina¬ 
tor component of the adversarial model. 

■ Because the discriminators parameters are frozen during adversarial model 
training, the discriminator will simply teli us whether it thinks the incoming 
images are real or fake. Even though the generator outputs fakes, they are 
labeled as real ( y = 1) and the cross-entropy cost is used during backprop- 
agation to update the weights of the generator model. By minimizing this 
cost, the generator should learn to produce fake images that the discriminator 
erroneously classifies as real. 

■ After each round of training, the adversarial loss and accuracy metrics are 
appended to the a_metrics list. 
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■ After every 100 epochs we: 

■ Print the epoch that we are in. 

■ Print a log message that includes the discriminator and adversarial niodels’ 
running loss and accuracy metrics. 

■ Randomly sample 16 noise vectors and use the generators predict method 
to generate fake images, which are stored in gen_imgs. 

■ Plot the 16 fake images in a 4 x 4 grid so that we can monitor the quality of 
the generators images during training. 

■ At the conclusion of the trai n function, we return the lists of adversarial model 
and discriminator model metrics (a_metrics and d_metrics, respectively). 

■ Finally, we call the train function, saving the metrics into the 

a_metri cs_compl ete and d_metri cs_compl ete variables as training progresses. 

After 100 rounds (epochs) of training (see Figure 12.10), our GAN’s fake images 
appear to have some vague sketch-like structure, but we can’t yet discern apples in them. 
After 200 rounds, however (see Figure 12.11), the images do begin to have a loose appley- 
ness to them. Over several hundred more rounds of training, the GAN begins to produce 
some compelling forgeries ofapple sketches (Figure 12.12). And, after 2,000 rounds, our 
GAN output the “machine art” demo images that we provided way back at the end of 
Chapter 3 (Figure 3.9). 

To wrap up our Generative Adversarial NetWork notebook, we ran the code in 
Examples 12.9 and 12.10 to create plots of our GAN’s training loss (Figure 12.13) and 
training accuracy (Figure 12.14). These show that the adversarial models loss declined as 
the quality of the apple-sketch forgeries improved; that is what we’d expect because this 
















Figure 12.10 Fake apple sketches generated after 100 epochs of training our GAN 
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Figure 12.11 Fake apple sketches after 200 epochs of training our GAN 




Figure 12.12 Fake apple sketches after 1,000 epochs of training our GAN 


models loss is associated with fake images being misclassified as real ones by the discrim¬ 
inator network, and you can see froni Figures 12.10, 12.11, and 12.12 that, the longer 
we trained, the increasingly real the fakes appeared. As the generator component of the 
adversarial model began to produce higher-quality fakes, the discriminator s task of dis- 
cerning real apple sketches from fake ones became more difficult, and so its loss generally 
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Figure 12.13 GAN training loss over epochs 
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Figure 12.14 GAN training accuracy over epochs 




rose over the first 300 epochs. From the ~300th epoch onward, the discriminator mod- 
estly improved at its binary classification task, corresponding to a gentle decrease in its 
training loss and an increase in its training accuracy. 

Example 12.9 Plotting our GAN training loss 

ax = pd.DataFrame( 

{ 

'Adversarial': [metric[0] for metric in a_metrics_complete], 
'Discriminator': [metric[0] for metric in d_metrics_compl ete], 

} 

).plot(title=' Training Loss', logy=True) 
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ax.set_xlabel( "Epochs" ) 
ax.set_y1abel( "Loss" ) 


Example 12.10 Plotting our GAN training accuracy 

ax = pd.DataFrame( 

{ 

'Adversarial' : [metric[1] for metric in a_metrics_complete], 
'Discriminator': [metric[1] for metric in d_metrics_complete], 

} 

),plot(title=' Training Accuracy' ) 
ax.set_xlabel( "Epochs" ) 
ax.set_ylabel( "Accuracy" ) 


Summary 

In this chapter, we covered the essential theory of GANs, including a couple of new layer 
types (de-convolution and upsampling). We constructed discriminator and generator 
networks and then combined them to form an adversarial network. Through alternately 
training a discriminator model and the generator component of the adversarial model, the 
GAN learned how to create novel “sketches” of apples. 
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Key Concepts 


Here are the essential foundational 
chapter are highlighted in purple. 

■ parameters: 

■ weight w 

■ bias b 

■ activation a 

■ artificial neurons: 

■ sigmoid 

■ tanh 

■ ReLU 

■ linear 

■ input layer 

■ hidden layer 

■ output layer 

■ layer types: 

■ dense (fully connected) 

■ softmax 

■ convolutional 

■ de-convolutional 

■ max-pooling 

■ upsampling 

■ flatten 

■ embedding 

■ RNN 

■ (bidirectional-)LSTM 

■ concatenate 


thus far. New ternis from the current 

■ cost (loss) functions: 

■ quadratic (mean squared 
error) 

■ cross-entropy 

■ forward propagation 

■ backpropagation 

■ unstable (especially vanishing) 
gradients 

■ Glorot weight initialization 

■ batch normahzation 

■ dropout 

■ optimizers: 

■ stochastic gradient descent 

■ Adani 

■ optiinizer hyperparameters: 

■ learning rate 77 

■ batch size 

■ word 2 vec 

■ GAN components: 

■ discriminator network 

■ generator network 

■ adversarial network 


concepts 
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Deep Reinforcement Learning 


I n Chapter 4, we introduced the paradigm of reinforcement learning (as distinet from 
supervised and unsupervised learning), in which an agent (e.g., an algorithm) takes 
sequential actions within an environment. The environments—whether they be simulated 
or real world—can be extremely complex and rapidly changing, requiring sophisti- 
cated agents that can adapt appropriately in order to succeed at fulfilling their objective. 
Today, many of the most prolific reinforcement learning agents involve an artificial neural 
network, making them deep reinforcement learning algorithms. 

In this chapter, we will 

■ Cover the essential theory of reinforcement learning in general and, in particular, a 
deep reinforcement learning model called deep Q-learning 

■ Use Keras to construet a deep Q-learning network that learns how to excel within 
simulated, video game environments 

■ Discuss approaches for optimizing the performance of deep reinforcement learning 
agents 

■ Introduce families of deep RL agents beyond deep Q-learning 


Essential Theory of Reinforcement Learning 

Recall from Chapter 4 (specifically, Figure 4.3) that reinforcement learning is a machine 
learning paradigm involving: 

■ An agent taking an action within an environment (let s say the action is taken at some 
tmiestep t). 

■ The environment returning two types of information to the agent: 

1. Reivard: This is a scalar value that provides quantitative feedback on the action 
that the agent took at timestep t. This could, for example, be 100 points 
as a reward for acquiring cherries in the video game Pac-Man. The agents 
objective is to maximize the rewards it accumulates, and so rewards are what 
reinforce productive behaviors that the agent discovers under particular envi- 
ronmental conditions. 
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2. State : This is how the environment changes in response to an agent’s action. 
During the forthcoming timestep (t + 1), these will be the conditions for the 
agent to choose an action in. 

■ Repeating the above two steps in a loop until reaching some terminal state. This 
terminal state could be reached by, for example, attaining the maximum possible 
reward, attaining some specific desired outcome (such as a self-driving car reaching 
its programmed destination), running out of allotted time, using up the maximum 
number of permitted moves in a game, or the agent dying in a game. 

Reinforcement learning problems are sequential decision-making problems. In Chap¬ 
ter 4, we discussed a number of particular examples of these, including: 

■ Atari video games, such as Pac-Man, Pong, and Breakout 

■ Autonomous vehicles, such as self-driving cars and aerial drones 

■ Board games, such as Go, chess, and shogi 

■ Robot-arm manipulation tasks, such as removing a nail with a hammer 

The Cart-Pole Game 

In this chapter, we will use OpenAI Gym—a popular library of reinforcement learning 
environments (examples provided in Figure 4.13)—to train an agent to play Cart-Pole, a 
classic problem among academics working in the field of control theory. In the Cart-Pole 
game: 

■ The objective is to balance a pole on top of a cart. The pole is connected to the 
cart at a purple dot, which functions as a pin that permits the pole to rotate along 
the horizontal axis, as illustrated in Figure 13.1. 1 

■ The cart itself can only move horizontally, either to the left or to the right. At any 
given moment—at any given timestep —the cart must be moved to the left or to the 
right; it can’t remain stationary. 

■ Each episode of the game begins with the cart positioned at a random point near 
the center of the screen and with the pole at a random angle near vertical. 

■ As shown in Figure 13.2, an episode ends when either 

■ The pole is no longer balanced on the cart—that is, when the angle of the 
pole moves too far away from vertical toward horizontal 

■ The cart touches the boundaries—the far right or far left of the screen 

■ In the version of the game that you’ll play in this chapter, the maximum number 
of timesteps in an episode is 200. So, if the episode does not end early (due to los- 
ing pole balance or navigating off the screen), then the game will end after 200 
timesteps. 

■ One point of reward is provided for every timestep that the episode lasts, so the 
maximum possible reward is 200 points. 


1. An actual screen capture of the Cart-Pole game is provided in Figure 4.13a. 
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THE CARTPOLE GAME 



OBJECTI VE: 
Keep pole 
upright 


pin connecting pole to cart 



cart is controlled directly by player 
(can be moved lcft or right) 

Figure 13.1 The objective of the Cart-Pole game is to keep the brown pole balanced 
upright on top of the black cart for as long as possible. The player of the game (be it a 
human or a machine) contrais the cart by moving it horizontally to the left or to the right 
along the black line. The pole moves freely along the axis created by the purple pin. 


game ends early if: 



pole falis toward horizontal cart moves offscreen 

(pole angle too large) 

Figure 13.2 The Cart-Pole game ends early if the pole falis toward horizontal or the 

cart is navigated off-screen. 
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The Cart-Pole game is a popular introductory reinforcement learning problem because 
it’s so simple. With a self-driving car, there are effectively an infinite number of possible 
environmental States: As it moves along a road, its myriad sensors—cameras, radar, lidar, 2 
accelerometers, microphones, and so on—stream in broad swaths of state information 
from the world around the vehicle, on the order of a gigabyte of data per second. 3 The 
Cart-Pole game, in stark contrast, has merely four pieces of state information: 

1. The position of the cart along the one-dimensional horizontal axis 

2. The carfs velocity 

3. The angle of the pole 

4. The pole s angular velocity 

Likewise, a number of fairly nuanced actions are possible with a self-driving car, such 
as accelerating, braking, and steering right or left. In the Cart-Pole game, at any given 
timestep t, exactly one action can be taken from only two possible actions: move left or 
move right. 

Markov Decision Processes 

Reinforcement learning problems can be defmed mathematically as something called a 
Markov decision process. MDPs feature the so-called Markov property —an assumption that 
the current timestep contains all of the pertinent information about the state of the en- 
vironment from previous timesteps. With respect to the Cart-Pole game, this means that 
our agent would elect to move right or left at a given timestep t by considering only 
the attributes ofthe cart (e.g., its location) and the pole (e.g., its angle) at that particular 
timestep t. 4 

As summarized in Figure 13.3, the MDP is defmed by five components: 

1. S is the set of all possible States. Following set-theory convention, each individ- 
ual possible state (i.e., a particular combination of cart position, cart velocity, pole 
angle, and angular velocity) is represented by the lowercase s. Even when we con¬ 
sidet the relatively simple Cart-Pole game, the number of possible recombinations 
of its four state dimensions is enormous. To give a couple of coarse examples, the 
cart could be moving slowly near the far-right of the screen with the pole balanced 
vertically, or the cart could be moving rapidly toward the left edge of the screen 
with the pole at a wide angle turning clockwise with pace. 

2. A is the set of all possible actions. In the Cart-Pole game, this set contains only 
two elements ( left and right); other environments have many more. Each individual 
possible action is denoted as a. 


2. Same principle as radar, but uses lasers instead of sound. 

3. bit.ly/GBpersec 

4. The Markov property is assumed in many financial-trading strategies. As an example, a trading strategy might 
take into account the price of all the stocks listed on a given exchange at the end of a given trading day, while it 
does not consider the price of the stocks on any previous day. 
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agent 




environment 



“Markov Decision Process” 

5 : ali possible States 
1: all possible actions 
R: reward distribution 
given (s,<t) 

P : transition probability 
to s t+l given CvO 
y: discount factor 

Figure 13.3 The reinforcement learning loop (top; a rehashed version of Figure 4.3, 
provided again here for convenience) can be considered a Markov decision process, 
which is defined by the five components S, A, R, P, and 7 (bottom). 


3. R is the distribution of reward given a state-action pair —some particular state paired 
with some particular action—denoted as (s, a). Ifs a distribution in the sense of 
being a probability distribution: The exact same state-action pair (s, a) might ran- 
domly resuit in different amounts of reward r on different occasions. 5 The details of 
the reward distribution R — its shape, including its mean and variance—are hidden 
from the agent but can be glimpsed by taking actions within the environment. For 
example, in Figure 13.1, you can see that the cart is centered within the screen and 
the pole is angled slightly to the left. 6 We cl expect that pairing the action of moving 
left with this state s would, on average, correspond to a higher expected reward r 
relative to pairing the action of moving right with this state: Moving left in this state 
s should cause the pole to stand more upright, increasing the number of timesteps 
that the pole is kept balanced for, thereby tending to lead to a higher reward r. On 
the other hand, the move right in this state s would increase the probability that the 
pole would fall toward horizontal, thereby tending toward an early end to the game 
and a smaller reward r. 


5. Although this is true in reinforcement learning in general, the Cart-Pole game in particular is a relatively simple 
environment that is fully deterministic. In the Cart-Pole game, the exact same state-action pair (s, a) will in fact 
resuit in the same reward every time. For the purposes of illustrating the principies of reinforcement learning in 
general, we use examples in this section that imply the Cart-Pole game is less deterministic than it really is. 

6. For the sake of simplicity, let s ignore cart velocity and pole angular velocity for this example, because we can’t 
infer these state aspects from this static image. 
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4. P, like R, is also a probability distribution. In this case, it represents the proba- 
bility of the next state (i.e., s^+i) given a particular state-action pair (s, a) in the 
current timestep t. Like R, the P distribution is hidden from the agent, but again 
aspects of it can be inferred by taking actions within the environment. For exam- 
ple, in the Cart-Pole game, it would be relatively straightforward for the agent to 
learn that the left action corresponds directly to the cart moving leftward. 7 More- 
complex relationships—for example, that the left action in the state s captured 

in Figure 13.1 tends to correspond to a more vertically oriented pole in the next 
state —would be more difficult to learn and so would require more 

gameplay. 

5. -y (gamma) is a hyperparameter called the discount factor (also known as decay). To 
explain its significance, let’s move away from the Cart-Pole game for a moment and 
back to Pac-Man. The eponymous Pac-Man character explores a two-dimensional 
surface, gaining reward points for collecting fruit and dying if he gets caught by one 
of the ghosts thats chasing him. As illustrated by Figure 13.4, when the agent con- 
siders the value of a prospective reward, it should value a reward that can be attained 
immediately (say, 100 points for acquiring cherries that are only one pixefs dis— 
tance away from Pac-Man) more highly than an equivalent reward that would 
require more timesteps to attain (100 points for cherries that are a distance of 20 
pixels away). Immediate reward is more valuable than some distant reward, because 
we can’t bank on the distant reward: A ghost or sonie other hazard could get in 
Pac-Man’s way. 8,9 Ifwe were to set 'y = 0.9, then cherries one timestep away 
would be considered to be worth 90 points, 10 whereas cherries 20 timesteps away 
would be considered to be worth only 12.2 points. 11 


The Optimal Policy 

The ultimate objective with an MDP is to find a function that enables an agent to take 
an appropriate action a (from the set of ali possible actions A) when it encounters any 
particular state s from the set of ali possible environmental States S. In other words, we’d 


7. As with all of the other artificial neural networks in this book, the ANNs within deep reinforcement learning 
agents are initialized with random starting parameters. This means that, prior to any learning (via, say, playing 
episodes of the Cart-Pole game), the agent has no awareness of even the simplest relationships between some state- 
action pair (s, a) and the next state s$+l- For example, although it may be intuitive and obvious to a human 
player of the Cart-Pole game that the action left should cause the cart to move leftward, nothing is “intuitive” or 
“obvious” to a randomly initialized neural net, and so all relationships must be learned through gameplay. 

8 . The 7 discount factor is analogous to the discounted cash flow calculations that are common in accounting: 
Prospective income a year from now is discounted relative to income expected today. 

9. Later in this chapter, we introduce concepts called value functions ( V ) and Q-value functions ( Q ). Both V and 
Q incorporate 7 because it prevents them from becoming unbounded (and thus computationally impossible) in 
games with an infinite number of possible future timesteps. 

10. 100 x 7 * = 100 x 0.9 1 = 90 

11. 100 x 7 4 = 100 x 0.9 20 = 12.16 
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100 points 
20 timesteps 
away valued 
ac 12.2 


100 points 
one timestep 
away valued 
at 90 


Figure 13.4 Based on the discount factor 7 , in a Markov decision process more-distant 
reward is discounted relative to reward that’s more immediately attainable. Using the 
Atari game Pac-Man to illustrate this concept (a green trilobite sitting in for Mr. Pac-Man 
himself), with 7 = 0.9, cherries (or a fish!) only one timestep away are valued at 90 
points, whereas cherries (a fish) 20 timesteps away are valued at 12.2 points. Like the 
ghosts in the Pac-Man game, the octopus here is roaming around and hoping to kill the 
poor trilobite. This is why immediately attainable rewards are more valuable than distant 
ones: There’s a higher chance of being kiiled before reaching the fish that’s farther away. 



Figure 13.5 The policy function -k enables an agent to map any state s (from the set of 
all possible States S) to an action a from the set of all possible actions A. 


like our agent to learn a function that enables it to map S to A. As shown in Figure 13.5, 
such a function is denoted by 7T and we call it the policy function. 

The high-level idea of the policy function 7T, using vernacular language, is this: 
Regardless of the particular circumstance the agent finds itself in, what is the policy it 
should follow that will enable it to maximize its reward? For a more concrete definition of 
this reward-maximization idea, you are welcome to pore over this: 
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J(0 


max J(tt) = maxE 

7T 7T 


. £>0 


(13.1) 


In this equation: 

■ J( 7 r) is called an objective junctiori. This is a function that we can apply machine 
learning techniques to in order to maximize reward . 12 

■ 7 r represents any policy function that rnaps S to A. 


■ 7T* represents a particular, optimal policy (out of ali the potential 7r policies) for 
mapping S to A. That is, 7T* is a function that—fed any state s — will return an 
action a that will lead to the agent attaining the max -imum possible discounted 
future reward. 


Expected discounted future reward is defmed by E 




t>0 


where E stands for 


expectation and 7 * 7 't stands for the discounted future reward. 

t> o 

To calculate the discounted future reward y 7 t rt, over all future timesteps (i.e., 

t> 0 

t > 0 ), we do the following. 


■ Multiply the reward that can be attained in any given future timestep (rf) by 
the discount factor ofthat timestep ( 7 *). 

■ Accumulate these individual discounted future rewards ( 7 * 77 ) by summing 
them all up (using y]). 


Essential Theory of Deep Q-Learning Networks 

I 11 the preceding section, we defmed reinforcement learning as a Markov decision process. 
At the end of the section, we indicated that as part of an MDP, we’d like our agent— 
when it encounters any given state s at any given timestep t —to follow some optimal 
policy 7T* that will enable it to select an action a that maximizes the discounted future 
reward it can obtain. The issue is that—even with a rather simple reinforcement learn¬ 
ing problem like the Cart-Pole game—it is computationally intractable (or, at least, 
extremely computationally inefficient) to definitively calculate the maximum cumula- 

tive discounted future reward, max{ 7 t r t ). Because of all the possible future States 

t> 0 

S and all the possible actions A that could be taken in those future States, there are way 
too many possible future outcomes to take into consideration. Thus, as a computational 
shortcut, we’ll describe the Q-learning approach for estimating what the optimal action a 
in a given situation might be. 


12. The cost functions (a.k.a. loss functions) referred to throughout this book are examples of objective functions. 
Whereas cost functions return some cost value C , the objective function J(tt) returns some reward value r. With 
cost functions, our objective is to minimize cost, so we apply gradient descent to them (as depicted by the valley- 
descending trilobite back in Figure 8.2). With the function J{i r), in contrast, our objective is to maximize reward, 
and so we technically apply gradient ascent to it (conjuring up Figure 8.2 imagery, imagine a trilobite hiking to 
identify the peak of a mountain) even though the mathematics are the same as with gradient descent. 
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Value Functions 

The story of Q-learning is most easily described by beginning with an explanation of 
value functions. The value function is defmed by V lr (s). It provides us with an indication 
of how valuable a given state s is if our agent follows its policy 7r from that state onward. 

As a simple example, consider yet again the state s captured in Figure 13.1. 13 Assum- 
ing our agent already has sonae reasonably sensible policy 7T for balancing the pole, then 
the cunaulative discounted future reward that we’d expect it to obtain in this state is prob- 
ably fairly large because the pole is near vertical. The value V 7r (s), then, ofthis particular 
state s is high. 

On the other hand, if we imagine a state S/, where the pole angle is approaching hor- 
izontal, the value ofit— V^(sh )— is lower, because our agent has already lost control of 
the pole and so the episode is likely to terminate within the next few timesteps. 

Q-Value Functions 

The Q-valuefunction 1 4 builds on the value function by taking into account not only 
state: It considers the utility of a particular action when that action is paired with a given 
state—that is, it rehashes our old friend, the state-action pair symbolized by (s, a). 

Thus, where the value function is defmed by V lr (s), the Q-value function is defmed by 
Q*(s,a). 

Let’s return once more to Figure 13.1. Pairing the action left (lets call this af) with 
this state s and then following a pole-balancing policy 7r from there should generally 
correspond to a high cumulative discounted future reward. Therefore, the Q-value of this 
state-action pair ( s, ) is high. 

In comparison, lets consider pairing the action right (we can call it af) with the state 
s from Figure 13.1 and then following a pole-balancing policy 7T from there. Although 
this might not turn out to be an egregious error, the cumulative discounted future reward 
would nevertheless probably be somewhat lower relative to taking the left action. In this 
state s, the left action should generally cause the pole to become more vertically oriented 
(enabling the pole to be better controlled and better balanced), whereas the rightward 
action should generally cause it to become somewhat more horizontally oriented—thus, 
less controlled, and the episode somewhat more likely to end early. All in ali, we would 
expect the Q-value of (s, af) to be higher than the Q-value of (s, ajf). 

Estimatingan Optimal Q-Value 

When our agent confronts some state s, we would then like it to be able to calculate 
the optimal Q-value, denoted as Q*(s , a). We could consider all possible actions, and the 
action with the highest Q-value—the highest cumulative discounted future reward— 
would be the best choice. 

In the same way that it is computationally intractable to defmitively calculate the 
optimal policy tc* (Equation 13.1) even with relatively simple reinforcement learning 
problems, so too is it typically computationally intractable to defmitively calculate an 


13. As we did earlier in this chapter, lets consider cart position and pole position only, because we can’t speculate 
on cart velocity or pole angular velocity from this stili image. 

14. The “Q” in Q-value stands for quality but you seldom hear practitioners calling these “quality-value functions.” 
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optimal Q-value, Q*(s, a). With the approach of deep Q-learning (as introduced in 
Chapter 4; see Figure 4.5), however, we can leverage an artificial neural network to esti- 
mate what the optimal Q-value might be. These deep Q-learning networks (DQNs for 
short) rely on this equation: 

Q*{s, a) « Q(s, a; 0 ) (13.2) 

In this equation: 

■ The optimal Q-value (Q*(s, a)) is being approximated. 

■ The Q-value approximation function incorporates neural network model parame- 
ters (denoted by the Greek letter theta, 0) in addition to its usual state s and action 
a inputs. These parameters are the usual artificial neuron weights and biases that we 
have become familiar with since Chapter 6. 

In the context of the Cart-Pole game, a DQN agent armed with Equation 13.2 can, 
upon encountering a particular state s, calculate whether pairing an action a ( left or right ) 
with this state corresponds to a higher predicted cumulative discounted future reward. If, 
say, left is predicted to be associated with a higher cumulative discounted future reward, 
then this is the action that should be taken. In the next section, we’ll code up a DQN 
agent that incorporates a Keras-built dense neural net to illustrate hands-on how this is 
done. 



For a thorough introduction to the theory of reinforcement learning, including 
deep Q-learning networks, we recommend the recent edition of Richard Sutton 
(Figure 13.6) and Andrew Bartos Reinforcement Learning: An Introduction, which is 
available free of charge at bi t. 1 y/SuttonBarto. 



Figure 13.6 The biggest star in the field of reinforcement learning, Richard Sutton has 
long been a computer Science professor at the University of Alberta. He is more recently 
also a distinguished research scientist at Google DeepMind. 


15. Sutton, R., & Barto, A. (2018). Reinforcement Learning: An Introduction (2nd ed.). Cambridge, MA: MIT Press. 
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Defining a DQN Agent 

Our code for defining a DQN agent that learns how to act in an environment—in this 
particular case, it happens to be the Cart-Pole game from the OpenAI Gym library of 
environments—is provided within our Cartpole DQN Jupyter notebook. 16 Its dependen- 
cies are as follows: 

import random 

import gym 

import numpy as np 

from collections import deque 

from keras.models import Sequentia! 

from keras.layers import Dense 

from keras.optimizers import Adam 

import os 

The most significant new addition to the list is gym, the Open AI Gym itself. As usual, we 
discuss each dependency in more detail as we apply it. 

The hyperparameters that we set at the top of the notebook are provided in 
Example 13.1. 

Example 13.1 Cart-Pole DQN hyperparameters 

env = gym.make(' CartPole-vO' ) 

state_size = env,observation_space.shape[0] 

action_size = env.action_space.n 

batch_size = 32 

n_episodes = 1000 

output_dir = 'model_output/cartpole/' 
if not os.path.exists(output_dir): 
os.makedirs(output_dir) 

Let’s look at this code line by line: 

■ We use the Open AI Gym make() method to specify the particular environment 
that we’d like our agent to internet with. The environment we choose is version 
zero (vO) of the Cart-Pole game, and we assign it to the variable env. On your own 
time, youre welcome to select an alternative Open AI Gym environment, such as 
one of those presented in Figure 4.13. 

■ From the environment, we extract two parameters: 

1. state_si ze: the number of types of state information, which for the Cart- 
Pole game is 4 (recall that these are cart position, cart velocity, pole angle, 
and pole angular velocity). 

2. acti on_si ze: the number of possible actions, which for Cart-Pole is 2 ( left 
and right). 


16. Our DQN agent is based directly on Keon Kims, which is available at his GitHub repository at 
bit.1y/keonDQN. 
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■ We set our mini-batch size for training our neural net to 32. 

■ We set the number of episodes (rounds of the game) to 1000. As you’ll soon see, 
this is about the right number of episodes it will take for our agent to excel regu- 
larly at the Cart-Pole game. For more-complex environments, you’d likely need to 
increase this hyperparameter so that the agent has more rounds of gameplay to learn 
in. 

■ We defme a unique directory name ('model_output/cartpole/ ') into which 
we’ll output our neural networks parameters at regular intervals. If the directory 
doesn’t yet exist, we use os . makedi rs () to make it. 

The rather large chunk of code for creating a DQN agent Python class—called 

DQNAgent—is provided in Example 13.2. 

Example 13.2 A deep Q-learning agent 

class DQNAgent: 

def _ init_ (self, state_size, action_size): 

self.state_size = state_size 

self.action_size = action_size 

self.memory = deque(maxlen=2000) 

self.gamma =0.95 

sel f . epsi lon = 1.0 

self.epsi1on_decay = 0.995 

sel f.epsi1on_min = 0.01 

sel f.1earning_rate = 0.001 

self.model = self._bui1d_model() 

def _bui1d_model (self) : 
model = Sequential() 

model,add(Dense(32, activation= 'reiu' , 

input_dim=self.state_si ze)) 
model.add(Dense (32 , acti vation= 'reiu' )) 
model.add(Dense(self.action_size, acti vation=' 1 i near' )) 
model.compile(loss='mse', 

optimizer=Adam(lr=self.1earning_rate)) 

return model 

def remember(self, state, action, reward, next_state, done): 
self.memory.append((state, action, 

reward, next_state, done)) 

def train(self, batch_size): 

minibatch = random.sample(self.memory, batch_size) 
for state, action, reward, next_state, done in minibatch: 
target = reward # if done 
if not done: 
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target = (reward + 

self. gamma * 

np.amax(self.model.predict(next_state)[0])) 
target_f = self.model.predict(state) 
target_f[0][action] = target 

self.model.fit(state, target_f, epochs=1, verbose=0) 
if self.epsilon > self.epsi1on_min: 

self.epsilon *= self.epsi1on_decay 

def act(self, state): 

if np.random.rand() <= self.epsilon: 

return random.randrange(self.action_size) 
act_values = self.model.predict(state) 
return np.argmax(act_values[0]) 

def save(self, name): 

self.model.save_weights(name) 

def load(self, name): 

self.model.1oad_weights(name) 


Initialization Parameters 

We begin Example 13.2 by initializing the class with a number of parameters: 

■ state_si ze and acti on_si ze are environment-specific, but in the case of the 
Cart-Pole garne are 4 and 2, respectively, as mentioned earlier. 

■ memory is for storing memories that can subsequently be replayed in order to train our 
DQN’s neural net. The memories are stored as elements of a data structure called 

a deque (pronounced “deck”), which is the same as a list except that—because we 
specified maxl en=2000 —it only retains the 2,000 most recent memories. That is, 
whenever we attempt to append a 2,001st element onto the deque, its first element 
is removed, always leaving us with a list that contains no more than 2,000 elements. 

■ gamma is the discount factor (a.k.a. decay rate) y that we introduced earlier in this 
chapter (see Figure 13.4). This agent hyperparameter discounts prospective rewards 
in future timesteps. Effective y values typically approach 1 (for example, 0.9, 0.95, 
0.98, and 0.99). The closer to 1, the less were discounting future reward. 17 Tuning 
the hyperparameters of reinforcement learning models such as 'f can be a fiddly 
process; near the end of this chapter, we discuss a tool called SLM Lab for carrying 
it out effectively. 


17. Indeed, if you were to set -7 = 1 (which we don’t recommend) you wouldn’t be discounting future reward 
at ali. 
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■ epsi 1 on —symbolized by the Greek letter e —is another reinforcement learning 
hyperparameter called exploration rate. It represents the proportion of our agent s 
actions that are random (enabling it to explore the inipact of such actions on the 
next state St+i and the reward r returned by the environment) relative to how 
often we allow its actions to exploit the existing “knowledge” its neural net has 
accumulated through gameplay. Prior to having played any episodes, agents have 
no gameplay experience to exploit, so it is the most common practice to start it off 
exploring 100 percent of the time; this is why we set epsi lon = 1.0. 

■ As the agent gains gameplay experience, we very slowly decay its exploration rate 
so that it can gradually exploit the information it has learned (hopefully enabling 
it to attain more reward, as illustrated in Figure 13.7). That is, at the end of each 



+100 reward 


Figure 13.7 As in Figure 13.4, here we use the Pac-Man environment (with a green 
trilobite representing a DQN agent in place of the Mr. Pac-Man character) to illustrate a 
reinforcement learning concept. In this case, the concept is exploratory versus 
exploitative actions. The higher the hyperparameter e (epsilon) in a given episode, the 
more likely the agent is to be in its exploratory mode, in which it takes purely random 
actions: By chance, an agent in this mode might navigate in the opposite direction of a 
fish that would have provided an immediate reward of 100 points. The alternative to the 
exploratory mode is the exploitative mode. Assuming the DQN agent’s neural net 
parameters have already benefited from some previous gameplay experience, in its 
exploitative mode the agent’s policy should be to acquire reward that is immediately 

available to it. 
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episode the agent plays, we multiply its e by epsi 1 on_decay. Common options for 
this hyperparameter are 0.990, 0.995, and 0.999. 18 

■ epsi i on_mi n is a floor (a minimum) on how low the exploration rate e can decay 
to. This hyperparameter is typically set to a near-zero value such as 0.001, 0.01, or 
0.02. We set it equal to 0.01, meaning that after e has decayed to 0.01 (as it will 
in our case by the 911th episode), our agent will explore on only 1 percent ofthe 
actions it takes—exploiting its gameplay experience the other 99 percent of the 

19 

time. 

■ 1 earni ng_rate is the sanae stochastic gradient descent hyperparameter that we 
covered in Chapter 8. 

■ Finally, _bui 1 cRmodel ()—by the inclusion of its leading underscore—is being 
suggested as a private method. This means that this method is recommended for use 
“internally” only—that is, solely by instances of the class DQNAgent. 

Building the Agent’s Neural Network Model 

The _bui 1 d_model () method of Example 13.2 is dedicated to constructing and com- 
piling a Keras-specified neural network that maps an environments state s to the agent’s 
Q-value for each available action a. Once trained via gameplay, the agent will then be 
able to use the predicted Q-values to select the particular action it should take, given a 
particular environmental state it encounters. Withm the method, there is nothing you 
haven’t seen before in this book: 

■ We specify a sequential model. 

■ We add to the model the following layers of neurons. 

■ The first hidden layer is dense, consisting of 32 ReLU neurons. Using the 
input_dim argument, we specify the shape ofthe network’s input layer, 
which is the dimensionality of the environments state information s. In the 
case of the Cart-Pole environment, this value is an array of length 4, with 
one element each for cart position, cart velocity, pole angle, and pole angular 
velocity. 20 

■ The second hidden layer is also dense, with 32 ReLU neurons. As mentioned 
earlier, we’ll explore hyperparameter selection—including how we horne in 
on a particular model architecture—by discussing the SLM Lab tool later on 
in this chapter. 



18. Analogous to setting 7 = 1, setting epsi 1 on_decay = 1 would mean e would not be decayed at ali—that is, 
exploring at a continuous rate. This would be an unusual choice for this hyperparameter. 

19. If at this stage this exploration rate concept is somewhat unclear, it should become clearer as we examine our 
agents episode-by-episode results later on. 

20. In environments other than Cart-Pole, the state information might be much more complex. For example, 
with an Atari video game environment like Pac-Man, state s would consist of pixels on a screen, which would be 
a two- or three-dimensional input (for monochromatic or full-color, respectively). In a case such as this, a better 
choice of first hidden layer would be a convolutional layer such as Conv2D (see Chapter 10). 
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■ The output layer has dimensionality corresponding to the number of possible 
actions. 21 In the case of the Cart-Pole game, this is an array of length 2, with 
one element for left and the other for right. As with a regression model (see 
Example 9.8), with DQNs the 2 values are output directly from the neural 
net instead of being converted into a probability between 0 and 1. To do this, 
we specify the I i near activation function instead of the signioid or softmax 
functions that have otherwise dominated this book. 

■ As indicated when we compiled our regression model (Example 9.9), mean squared 
error is an appropriate choice of cost function when we use linear activation in the 
output layer, so we set the compi 1 e () methods 1 oss argument to mse. We return 
to our routine optimizer choice, Adam. 

Remembering Gameplay 

At any given timestep t —that is, during any given iteration of the reinforcement learning 
loop (refer back to Figure 13.3)—the DQN agent’s remember() method is run in order 
to append a memory to the end ofits memory deque. Each memory in this deque consists 
of five pieces of information about timestep t: 

1. The state St that the agent encountered 

2. The action at that the agent took 

3. The reward that the environment returned to the agent 

4. The next_state s t+ i that the environment also returned to the agent 

5. A Boolean flag done that is true if timestep t was the final iteration of the episode, 
and f al se otherwise 

Training via Memory Replay 

The DQN agent’s neural net model is trained by replaying memories of gameplay, as shown 
within the trai n ( ) method of Example 13.2. The process begins by randomly sampling 
a mi ni batch of 32 (as per the agents batch_si ze parameter) memories from the memory 
deque (which holds up to 2,000 memories). Sampling a small subset of memories from 
a much larger set of the agents experiences makes model-training more efficient: If we 
were instead to use, say, the 32 most recent memories to train our model, many of the 
States across those memories would be very similar. To illustrate this point, consider a 
timestep t where the cart is at sonie particular location and the pole is near vertical. The 
adjacent timesteps (e.g., t — 1, t + 1, t + 2) are also likely to be at nearly the same location 
with the pole in a near-vertical orientation. By sampling from across a broad range of 
memories instead of temporally proximal ones, the model will be provided with a richer 
cornucopia of experiences to learn from during each round of training. 

For each of the 32 sampled memories, we carry out a round of model training as 
follows: If done is T rue —that is, if the memory was of the final timestep of an episode— 
then we know definitively that the highest possible reward that could be attained from 



21. Any previous models in this book with only two outcomes (as in Chapters 11 and 12) used a single sigmoid 
neuron. Here, we specify separate neurons for each of the outcomes, because we would like our code to generalize 
beyond the Cart-Pole game. While Cart-Pole has only two actions, many environments have more than two. 
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this timestep is equal to the reward r t . Thus, we can just set our target reward equal to 
reward. 

Otherwise (i.e., if done is Fal se) then we try to estiniate what the target reward— 
the maximum discounted future reward—might be. We perform this estimation by 
starting with the known reward rt and adding to it the discounted 22 maximum future 
Q-value. Possible future Q-values are estimated by passing the next (i.e., future) state 
into the models predi ct () method. Doing this in the context of the Cart-Pole 
game returns two outputs: one output for the action left and the other for the action right. 
Whichever of these two outputs is higher (as determined by the NumPy amax function) is 
the maximum predicted future Q-value. 

Whether target is known definitively (because the timestep was the final one in an 
episode) or its estimated using the maximum future Q-value calculation, we continue 
onward within the trai n () methods for loop: 

■ We run the predi ct () method again, passing in the current state s t . As before, in 
the context of the Cart-Pole game this returns two outputs: one for the left action 
and one for the right. We store these two outputs in the variable target_f. 

■ Whichever acti on at the agent actually took in this memory, we use 
target_f [0] [action] = target to replace that target_f output with the 
target reward. 23 

■ We train our model by calling the f i t () method. 

■ The model input is the current state s t and its output is target_f, which 
incorporates our approximation of the maximum future discounted reward. 
By tuning the models parameters (represented by 6 in Equation 13.2), we 
thus improve its capacity to accurately predict the action that is more likely to 
be associated with maximizing future reward in any given state. 

■ In many reinforcement learning problems, epochs can be set to 1 . Instead of 
recycling an existing training dataset multiple times, we can cheaply engage in 
more episodes of the Cart-Pole game (for example) to generate as many fresh 
training data as we fancy. 

■ We set verbose=0 because we don’t need any model-fitting outputs at this 
stage to monitor the progress of model training. As we demonstrate shortly, 
we ll instead monitor agent performance on an episode-by-episode basis. 

Selecting an Action to Take 

To select a particular action a t to take at a given timestep t, we use the agents act () 
method. Within this method, the NumPy rand function is used to sample a random 
value between 0 and 1 that we’ll call v. In conjunction with our agents epsi 1 on, 


22. That is, multiplied by gamma, the discount factor *y. 

23. We do this because we can only train the Q-value estimate based on actions that were actually taken by the 
agent: We estimated target based on next_state St+i and we only know what St+ 1 was for the action at 
that was actually taken by the agent at timestep t. We don’t know what next state St+i the environment might 
have returned had the agent taken a different action than it actually took. 
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epsi 1 on_decay, and epsi 1 on_mi n hyperparameters, this v value will deterniine for 
us whether the agent takes an exploratory action or an exploitative one: 24 

■ If the random value v is less than or equal to the exploration rate e, then a random 
exploratory action is selected using the randrange function. In early episodes, 
when e is high, most of the actions will be exploratory. In later episodes, as e de- 
cays further and further (according to the epsi 1 on_decay hyperparameter), the 
agent will take fewer and fewer exploratory actions. 

■ Otherwise—that is, if the random value v is greater than e —the agent selects an 
action that exploits the “knowledge” the model has learned via memory replay. 

To exploit this knowledge, the state s* is passed in to the models predi ct ( ) 
method, which returns an activation output 25 for each of the possible actions the 
agent could theoretically take. We use the NumPy argmax function to select the 
action at associated with the largest activation output. 

Savingand Loading Model Parameters 

Finally, the save() and ioad() methods are one-liners that enable us to save and load 
the parameters of the model. Particularly with respect to complex environments, agent 
performance can be flaky: For long stretches, the agent may perform very well in a 
given environment, and then later appear to lose its capabilities entirely. Because of this 
flakiness, it’s wise to save our model parameters at regular intervals. Then, if the agent’s 
performance drops off in later episodes, the higher-performing parameters from some 
earlier episode can be loaded back up. 

Interacting with an OpenAI Gym Environment 

Having created our DQN agent class, we can initialize an instance of the class—which we 
name agent —with this line of code: 

agent = DQNAgent(state_size, action_size) 

The code in Example 13.3 enables our agent to internet with an OpenAI Gym envi- 
ronment, which in our particular case is the Cart-Pole game. 

Example 13.3 DQN agent interacting with an OpenAI Gym environment 

for e in range(n_episodes): 
state = env.reset() 

state = np.reshape(state, [1, state_size]) 


24. We introduced the exploratory and exploitative modes of action when discussing the initialization parameters 
for our DQNAgent class earlier, and theyre illustrated playfully in Figure 13.7. 

25. Recall that the activation is linear, and thus the output is no! a probability; instead, it is the discounted future 
reward for that action. 
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done = False 
time = 0 
while not done: 

# env.renderf) 

action = agent.act(state) 

next_state, reward, done, _ = env.step(action) 

reward = reward if not done else -10 

next_state = np.reshape(next_state, [1, state_size]) 

agent.remember(state, action, reward, next_state, done) 

state = next_state 

if done: 

print ( "episode: {}/{}, score: {}, e: {:.2}" 

.format(e, n_episodes- 1 , time, agent.epsi1on)) 

time += 1 

if 1 en(agent.memory) > batch_size: 

agent.train(batch_size) 
if e % 50 == 0: 

agent.save(output_dir + "weights_" 

+ '{:04d}' .format(e) + ".hdf5") 

Recalling that we had set the hyperparameter n_epi sodes to 1000, Example 13.3 
consists of a big for loop that allows our agent to engage in these 1,000 rounds of game- 
play. Each episode of ganieplay is counted by the variable e and involves: 

■ We use env . reset () to begin the episode with a random state Sj. For the pur- 
poses ofpassing state into our Keras neural network in the orientation the model 
is expecting, we use reshape to convert it from a column into a row. 26 

■ Nested within our thousand-episode loop is a whi 1 e loop that iterates over the 
timesteps of a given episode. Until the episode ends (i.e., until done equals True), 
in each timestep t (represented by the variable time), we do the following. 

■ The env . render () line is commented out because ifyou are running this 
code via a Jupyter notebook within a Docker Container, this line will cause 
an error. If, however, you happen to be running the code via sonie other 
means (e.g., in a Jupyter notebook without using Docker) then you can try 
uncommenting this line. If an error isn’t thrown, then a pop-up window 
should appear that renders the environment graphically. This enables you to 
watch your DQN agent as it plays the Cart-Pole game in real time, episode 
by episode. It s fun to watch, but ifs by no means essential: It certainly has no 
impact on how the agent learns! 

■ We pass the state St into the agent’s act () method, and this returns the 
agenfs action a t , which is either 0 (representing left ) or 1 ( right ). 


26. We previously performed this transposition for the same reason back in Example 9.11. 
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■ The action a t is provided to the environments step() method, which 
returns the next_state s t+ i, the current reward r t , and an update to the 
Boolean flag done. 

■ If the episode is done (i.e., done equals true), then we set reward to a neg¬ 
ative value (-10). This provides a strong disincentive to the agent to end an 
episode early by losing control of balancing the pole or navigating off the 
screen. Ifthe episode is not done (i.e., done is False), then reward is +1 for 
each additional timestep of gameplay. 

■ In the sanie way that we needed to reorient state to be a row at the start of 
the episode, we use reshape to reorient next_state to a row here. 

■ We use our agents remember ( ) method to save ali the aspects of this timestep 
(the state St, the action at that was taken, the reward rt, the next state 

and the flag done) to memory. 

■ We set state equal to next_state in preparation for the next iteration of the 
loop, which will be timestep t + 1. 

■ If the episode ends, then we print summary metrics on the episode (see 
Figures 13.8 and 13.9 for example outputs). 

■ Add 1 to our timestep counter time. 

■ If the length of the agents memory deque is larger than our batch size, then we 
use the agents train () method to train its neural net parameters by replaying its 
memories of gameplay. 27 

■ Every 50 episodes, we use the agents save( ) method to store the neural net 
models parameters. 

As shown in Figure 13.8, during our agents first 10 episodes of the Cart-Pole game, 
the scores were low. It didn’t manage to keep the game going for more than 42 timesteps 
(i.e., a score of 41). During these initial episodes, the exploration rate e began at 100 
percent. By the lOth episode, e had decayed to 96 percent, meaning that the agent was 
in exploitative mode (refer back to Figure 13.7) on about 4 percent of timesteps. At this 
early stage of training, however, most of these exploitative actions would probably have 
been effectively random anyway. 

As shown in Figure 13.9, by the 991st episode our agent had mastered the Cart-Pole 
game. It attained a perfect score of 199 in ali of the fmal 10 episodes by keeping the game 
going for 200 timesteps in each one. By the 911th episode, 2S the exploration rate e had 
reached its minimum of 1 percent so during all of these fmal episodes the agent is in 
exploitative mode in about 99 percent of timesteps. From the perfect performance in 
these final episodes, it’s ciear that these exploitative actions were guided by a neural net 
well trained by its gameplay experience from previous episodes. 


27. You can optionally move this training step up so that its inside the while loop. Each episode will take a lot 
longer because you’ll be training the agent much more often, but your agent will tend to solve the Cart-Pole game 
in far fewer episodes. 

28. Not shown here, but can be seen in our Cartpole DQN Jupyter notebook. 
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Figure 13.8 The performance of our DQN agent during its first 10 episodes playing the 
Cart-Pole game. Its scores are low (keeping the game going for between 10 and 42 
timesteps), and its exploration rate e is high (starting at 100 percent and decaying to 96 

percent by the lOth episode). 
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Figure 13.9 The performance of our DQN agent during its final 10 episodes playing 
the Cart-Pole game. It scores the maximum (199 timesteps) across all 10 episodes. The 
exploration rate e had already decayed to its minimum of 1 percent, so the agent is in 
exploitative mode for ~99 percent of its actions. 


As mentioned earlier in this chapter, deep reinforcement learning agents often dis- 
play finicky behavior. When you train your DQN agent to play the Cart-Pole game, 
you might find that it performs very well during some later episodes (attaining many 
consecutive 200-timestep episodes around, say, the 850th or 900th episode) but 
then it performs poorly around the final (1,000th) episode. If this ends up being the 
case, you can use the 1 oad () method to restore rnodel parameters from an earlier, 
higher-performing phase. 


Hyperparameter Optimization with SLM Lab 

At a number of points in this chapter, in one breath we’d introduce a hyperparameter and 
then in the next breath we’d indicate that we’d later introduce a tool called SLM Lab for 
tuning that hyperparameter. 29 Well, that moment has arrived! 


29. “SLM” is an abbreviation of strange loop machine, with the strange loop concept being related to ideas about the 
experience ofhuman consciousness. See Hofstadter, R. (1979). Godel, Escher, Bach. New York: Basic Books. 
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SLM Lab is a deep reinforcement learning framework developed by Wah Loon Keng 
and Laura Graesser, who are California-based Software engineers (at the mobile-gaming 
firm MZ and within the Google Brain team, respectively). The framework is available at 
gi thub . com/kengz/SLM-Lab and has a broad range of implementations and functionality 
related to deep reinforcement learning: 

■ It enables the use of many types of deep reinforcement learning agents, including 
DQN and others (forthcoming in this chapter). 

■ It provides modular agent components, allowing you to dream up your own novel 
categories of deep RL agents. 

■ You can straightforwardly drop agents into environments from a number of differ¬ 
ent environment libraries, such as OpenAI Gym and Unity (see Chapter 4). 

■ Agents can be trained in multiple environments simultaneously. For example, a 
single DQN agent can at the same time solve the OpenAI Gym Cart-Pole game 
and the Unity ball-balancing game Ball2D. 

■ You can benchmark your agents performance in a given environment against 
others’ efforts. 

Critically, for our purposes, the SLM Lab also provides a painless way to experiment 
with various agent hyperparameters to assess their impact on an agents performance in a 
given environment. Consider, for example, the experimentgraph shown in Figure 13.10. In 
this particular experiment, a DQN agent was trained to play the Cart-Pole game during 
a number of distinet trials. Each trial is an instance of an agent with particular, distinet 
hyperparameters trained for many episodes. Some of the hyperparameters varied between 
trials were as follows. 
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Figure 13.10 An experiment run with SLM Lab, investigating the impact of various 
hyperparameters (e.g., hidden-layer architecture, activation function, learning rate) on the 
performance of a DQN agent within the Cart-Pole environment 
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■ Dense net model architecture 

■ [32] : a single hidden layer, with 32 neurons 

■ [64]: also a single hidden layer, this time with 64 neurons 

■ [32, 16]: two hidden layers; the first with 32 neurons and the second 
with 16 

■ [64 , 32] : also with two hidden layers, this time with 64 neurons in the first 
hidden layer and 32 in the second 

■ Activation function across ali hidden layers 

■ Sigmoid 

■ Tanh 

■ ReLU 

■ Optimizer learning rate (?]), which ranged from zero up to 0.2 

■ Exploration rate (e) annealing, which ranged from 0 to 100 30 

SLM Lab provides a number of metrics for evaluating model performance (sonie of 
which can be seen along the vertical axis of Figure 13.10): 

■ Strength : This is a measure of the cumulative reward attained by the agent. 

■ Speed: This is how quickly (i.e., over how many episodes) the agent was able to 
reach its strength. 

■ Stability : After the agent solved how to perform well in the environment, this is a 
measure of how well it retained its solution over subsequent episodes. 

■ Consistency: This is a metric ofhow reproducible the performance ofthe agent was 
across trials that had identical hyperparameter settings. 

■ Fitness: An overall summary metric that takes into account the above four 
metrics simultaneously. Using the fitness metric in the experiment captured by 
Figure 13.10, it appears that the following hyperparameter settings are optimal for 
this DQN agent playing the Cart-Pole game: 

■ A single-hidden-layer neural net architecture, with 64 neurons in that single 
layer outperforming the 32-neuron model. 

■ The tanh activation function for the hidden layer neurons. 

■ A low learning rate ( 77 ) of ~0.02. 

■ Trials with an exploration rate (e) that anneals over 10 episodes outperform 
trials that anneal over 50 or 100 episodes. 


30. Annealing is an alternative to e decay that serves the same purpose. With the epsi 1 on and epsi 1 on_mi n hyper- 
parameters set to fixed values (say, 1.0 and 0.01, respectively), variations in annealing will adjust epsi 1 on_decay 
such that an e of 0.01 will be reached by a specified episode. If, for example, annealing is set to 25 then e will 
decay at a rate such that it lowers uniformly from 1.0 in the first episode to 0.01 after 25 episodes. If annealing 
is set to 50 then e will decay at a rate such that it lowers uniformly from 1.0 in the first episode to 0.01 after 50 
episodes. 
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Details of running SLM Lab are beyond the scope of our book, but the library is well 
documented at kengz.gitbooks.io/slm-lab.’ 1 

Agents Beyond DQN 

In the world of deep reinforcement learning, deep Q-learning networks like the one we 
built in this chapter are relatively simple. To their credit, not only are DQNs (compar- 
atively) simple, but—relative to many other deep RL agents—they also make efficient 
use of the training samples that are available to them. That said, DQN agents do have 
drawbacks. Most notable are: 

1. If the possible number of state-action pairs is large in a given environment, then the 
Q-function can beconie extremely complicated, and so it becomes intractable to 
estimate the optimal Q-value, Q*. 

2. Even in situations where finding Q* is computationally tractable, DQNs are not 
great at exploring relative to some other approaches, and so a DQN may not con¬ 
verge on Q* anyway. 

Thus, even though DQNs are sample efficient, they aren’t applicable to solving all 
problems. 

To wrap up this deep reinforcement learning chapter, let s briefly introduce the 
types of agents beyond DQNs. The main categories of deep RL agents, as shown in 
Figure 13.11, are: 

■ Value optimization: These include DQN agents and their derivatives (e.g., double 
DQN, dueling QN) as well as other types of agents that solve reinforcement learning 
problems by optimizing value functions (including Q-value functions). 


Deep RL Agents 



Cotnbincd Approaches, Combmed Approaches. 

e.g.: e^.: 

• Actor-Cridc (AC) . AlphaGO 


Figure 13.11 The broad categories of deep reinforcement learning agents 


31. At the time of this writing, SLM Lab installation is straightforward only on Unix-based Systems, including 
macOS. 
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■ Imitation learning : The agents in this category (e.g., behaviora1 cloning and conditional 
imitation learning algorithms) are designed to mimic behaviors that are taught to 
them through demonstration, by—for example—showing them how to place 
dinner plates on a dish rack or how to pour water into a cup. Although imitation 
learning is a fascinating approach, its range of applications is relatively small and we 
don’t discuss it further in this book. 

■ Model optimization: Agents in this category learn to predict future States based on 
(s, a) at a given timestep. An example of one such algorithm is Monte Carlo tree 
search (MCTS), which we introduced with respect to AlphaGo in Chapter 4. 

■ Policy optimization'. Agents in this category learn policies directly, that is, they directly 
learn the policy function 7T shown in Figure 13.5. We’ll cover these in further detail 
in the next section. 

Policy Gradients and the REINFORCE Algorithm 

Recall from Figure 13.5 that the purpose ofa reinforcement learning agent is to learn 
some policy function 7t that maps the state space S to the action space A. With DQNs, 
and indeed with any other value optimization agent, n is learned indirectly by estimating 
a value function such as the optimal Q-value, Q*. With policy optimization agents, 7r is 
learned directly instead. 

Policy gradient (PG) algorithms, which can perform gradient ascent 22 on 7T directly, 
are exemplified by a particularly well-known reinforcement learning algorithm called 
REINFORCE. 33 The advantage of PG algorithms like REINFORCE is that they are 
likely to converge on a fairly optimal solution, 34 so theyre more widely applicable than 
value optimization algorithms like DQN. The trade-off is that PGs have low consistency. 
That is, they have higher variance in their performance relative to value optimization 
approaches like DQN, and so PGs tend to require a larger number of training samples. 

The Actor-Critic Algorithm 

As suggested by Figure 13.11, the actor-critic algorithm is an RL agent that combines the 
value optimization and policy optimization approaches. More specifically, as depicted in 
Figure 13.12, the actor-critic combines the Q-learning and PG algorithms. At a high 
level, the resulting algorithm involves a loop that alternates between: 

■ Actor: a PG algorithm that decides on an action to take. 

■ Critic: a Q-learning algorithm that critiques the action that the actor selected, 
providing feedback on how to adjust. It can take advantage of efficiency tricks in 
Q-learning, such as memory replay. 


32. Because PG algorithms maximize reward (instead of, say, minimizing cost), they perform gradient ascent and 
not gradient descent. For more on this, see Footnote 12 in this chapter. 

33. Williams, R. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. 
Machine Learning, 8, 229—56. 

34. PG agents tend to converge on at least an optimal local solution, although some particular PG methods have 
been demonstrated to identify the optimal global solution to a problem. See Fazel, K., et al. (2018). Global con- 
vergence of policy gradient methods for the linear quadratic regulator. arXiv: 1801.05039. 
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Actor-Critic 



PG ■+■ Q-Learning 

Figure 13.12 The actor-critic algorithm combines the policy gradient approach to 
reinforcement learning (playing the role of actor) with the Q-learning approach (playing 

the role of critic). 



In a broad sense, the actor-critic algorithm is reminiscent of the generative adver- 
sarial networks of Chapter 12. GANs have a generator network in a loop with a 
discriminator network, with the former creating fake images that are evaluated by 
the latter. The actor-critic algorithm has an actor in a loop with a critic, with the 
former taking actions that are evaluated by the latter. 


The advantage of the actor-critic algorithm is that it can solve a broader range of prob- 
lems than DQN, while it has lower variance in performance relative to REINFORCE. 
That said, because of the presence of the PG algorithm within it, the actor-critic is stili 
somewhat sample inefficient. 

While implementing REINFORCE and the actor-critic algorithm are beyond the 
scope of this book, you can use SLM Lab to apply them yourself, as well as to examine 
their underlying code. 

Summary 

In this chapter, we covered the essential theory of reinforcement learning, including 
Markov decision processes. We leveraged that information to build a deep Q-learning 
agent that solved the Cart-Pole environment. To wrap up, we introduced deep RL 
algorithms beyond DQN such as REINFORCE and actor-critic. We also described SLM 
Lab—a deep RL framework with existing algorithm implementations as well as tools for 
optimizing agent hyperparameters. 

This chapter brings an end to Part III of this book, which provided hands-on 
applications ofmachine vision (Chapter 10), natural language processing (Chapter 11), 
art-generating models (Chapter 12), and sequential decision-making agents. In Part IV, 
the final part of the book, we will provide you with loose guidance on adapting these 
applications to your own projects and inclinations. 
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Key Concepts 

Listed here are the key concepts from across this book. The final concept—covered 
in the current chapter—is highlighted in purple. 


■ parameters: 

■ weight w 

■ bias b 

■ activation a 

■ artificial neurons: 

■ sigmoid 

■ tanh 

■ ReLU 

■ linear 

■ input layer 

■ hidden layer 

■ output layer 

■ layer types: 

■ dense (fully connected) 

■ softmax 

■ convolutional 

■ de-convolutional 

■ max-pooling 

■ upsampling 

■ flatten 

■ embedding 

■ RNN 

■ (bidirectional-)LSTM 

■ concatenate 


■ cost (loss) functions: 

■ quadratic (mean squared 
error) 

■ cross-entropy 

■ forward propagation 

■ backpropagation 

■ unstable (especially vanishing) 
gradients 

■ Glorot weight initialization 

■ batch normalization 

■ dropout 

■ optimizers: 

■ stochastic gradient descent 

■ Adam 

■ optimizer hyperparameters: 

■ learning rate 77 

■ batch size 

■ word 2 vec 

■ GAN components: 

discriminator network 

■ generator network 

■ adversarial network 

■ deep Q-learning 
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Moving Forward with Your 
Own Deep Learning Projects 


C . 

V^ongratulations, youve made it to the closing chapter ofthe book! In Part I, we in- 
troduced deep learning: what it is and how it has become predominant. In Part II, we 
delved into the essential theory of deep learning. And in Part III, we applied the theory 
you learned to a broad range of problems spanning vision, language, art, and changing 
environments. 

In this chapter, we provide you with resources and advice for moving onward from the 
examples provided in Part III to your very own deep learning projects, some of which 
could be oftremendous benefit to society. We cap everything offby providing context 
on how your work may contribute to deep learning s ongoing overhaul of Software glob- 
ally and perhaps even to the dawn of artificial general intelligence. 


Ideas for Deep Learning Projects 

In this section, we cover candidate ideas for your own first deep learning projects. 

Machine Vision and GANs 

The easiest way to get your feet wet with a deep learning problem of your own might 
be to load up the Fashion-MNIST dataset. 1 Keras comes preloaded with these data, 
which consist of 10 classes ofphotos of clothing (see Table 14.1). The Fashion-MNIST 
data have identical dimensions to the handwritten MNIST digits you familiarized yoursell 
with in Part II: They are 8-bit 28x28-pixel grayscale bitmaps (an example is provided in 
Figure 14.1) spread across 60,000 training-image and 10,000 validation-image sets. Thus, 
by replacing the data-loading line (e.g., Example 5.2) of any existing MNIST-classifying 


1. Xiao, H., et al. (2017). Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. 
arXiv: 1708.07747. 
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Table 14.1 Fashion-MNIST categories 


Class Label 

0 

1 

2 

3 

4 

5 

6 

7 

8 
9 


Descriptiori 

t-shirt 

trousers 

pullover 

dress 

coat 

sandal 

shirt 

sneaker 

bag 

ankle boot 



Figure 14.1 Following our pixel-by-pixel rendering of an MNIST digit (Figure 5.3), this is 
an example of an image from the Fashion-MNIST dataset. This particular image belongs 
to class 9, so—as per Table 14.1—it is an ankle boot. Check out our Fashion MNIST Pixel 
by Pixel Jupyter notebook for the code we used to create this figure. 


Jupyter notebook from this book with the following code, the Fashion-MNIST data can 
trivially be substituted in: 

from keras.datasets import fashfon_mnist 

(X_train, y_train), (X_valid, y_valid) = fashion_mnist.1oad_data() 

From there, you can begin experimenting with modifying your model architecture and 
tuning your hyperparameters to improve validation-set accuracy. The Fashion-MNIST 
data are quite a bit more challenging to classify relative to the handwritten MNIST dig- 
its, so they present a rewarding problem for applying the material you learned in this 
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book. By Chapter 10, we observed greater than 99 percent validation accuracy with 
MNIST (see Figure 10.9), but obtaining validation accuracy greater than 92 percent with 
Fashion-MNIST is not easy, and achieving anything greater than 94 percent is downright 
impressive. 

Other excellent machine vision datasets for deep learning image-classification models 
can be found via the following sources: 

■ Kaggle : This data-science competition platform has many real-world datasets. Build- 
ing a winning model could earn you real-world money, too! For example, the 
platforms Cdiscount Image Classification Challenge had a $35,000 cash prize 

for classifying images of products for a French e-commerce giant. 2 The datasets 
available via Kaggle come and go as competitions begin and end, but at any given 
time there are likely to be a number of large image datasets available—with model- 
building experience, kudos, and maybe even cash prizes for you to benefit from. 

■ Figure Eight: This data-labeling-via-crowdsourcing company (formerly known 
as CrowdFlower) provides dozens of publicly available, superbly curated 
image-classification datasets. To peruse whats available, visit figure-eight 

. com/data-for-everyone and search for the word image. 

■ The researcher Luke de Oliveira compiled a ciear, concise list of the best-known 
datasets among deep learning practitioners. Have a look under the “Computer 
Vision’’ heading at bi t. 1 y / LukeData. 

Ifyoure looking to build and tune your own GAN, small datasets you could start off 
with include: 

■ One or more of the classes of images from the Quick, Draw! dataset we leveraged 
in Chapter 12 3 

■ The Fashion-MNIST data 

■ The plain old handwritten MNIST digits 

Natural Language Processing 

In the same way that the Fashion-MNIST data plug right in to the image-classification 
models we built in this book, datasets curated by Xiang Zhang and his colleagues 
from Yann LeCuns (Figure 1.9) lab can be dropped straightforwardly into the natural- 
language-classification models we built in Chapter 11, making them ideal data for a first 
NLP project ofyour own. 

AU eight of Zhang et al.s natural language datasets are described in detail in their 
paper 4 and are available for download at bi t. 1 y / NLPdata. Each dataset is at least an or- 
der of magnitude larger than the 25,000-training-sample IMDb film-sentiment data we 
worked with in Chapter 11, enabling you to experiment with the value of much more 
complex deep learning models and much richer word-vector spaces. Six of the datasets 


2. bit.1y/kaggleCD 

3. gi thub.com/googlecreativel ab/quickdraw-dataset 

4. See section four (“Large-scale Datasets and Results”) of Zhang, X., et al. (2016). Character-level convolutional 
networks for text classification. arXiv: 1509.01626. 
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have more than two classes (which would require you to have multiple output neurons 
in a softmax layer), and the other two are binary classification problems (enabling you to 
retain the single sigmoid output we used for the IMDb data): 

■ Yelp Review Polarity : 560,000 training samples and 38,000 validation samples that are 
classed as either positive (four- or five-star) or negative (one- or two-star) reviews of 
Services and locations posted on the website Yelp 

■ Amazon Review Polarity: A whopping 3.6 million training samples and 400,000 val¬ 
idation samples collected from the e-retail giant Amazon that are either positive or 
negative product reviews 

As with machine vision, NLP data available from Kaggle, Figure Eight (again, search 
for the word sentiment or text on figure-eight .com/data-for-everyone), and Luke de 
Oliveira (under the “Natural Language” heading at bit. ly/LukeData) would form the 
basis ofsolid self-directed deep learning projects. 

Deep Reinforcement Learning 

A first deep reinforcement learning project could involve a: 

■ New environment: By changing the OpenAI Gym environment in our Cartpole DQN 
notebook, 5 you can use the DQN agent you studied in Chapter 13 to tackle an 
environment other than the Cart-Pole game. Some relatively simple options include 
Mountain Car (Mountai nCar-vO) and Frozen Lake (FrozenLake-vO). 

■ New agent: If you have access to a Unix-based machine (which includes ones run- 
ning macOS), you can install SLM Lab (Figure 13.10) to try out other agents (e.g., 
an Actor-Critic agent; see Figure 13.12). Sonie of these could be sophisticated 
enough to excel in advanced environments like the Atari ganies 6 provided by 
OpenAI Gym or the three-dimensional environments provided by Unity. 

Once youre comfortable with fairly advanced agents, it could be rewarding to try 
them out in other environments like DeepMind Lab (Figure 4.14) or to unleash a single 
agent upon multiple different environments simultaneously (SLM Lab can help facilitate 
this for you). 

Converting an Existing Machine Learning Project 

Although ali of the projects we’ve suggested thus far involve using third-party data 
sources, you may very well have collected data already yourself. You may even have al- 
ready used these data for machine learning—say, with a linear regression model or support 
vector machines. In these cases, you could feed the data you already have into a deep 
learning model. You could begin with a three-hidden-layer dense net like our Deep Net in 
Keras model from Chapter 9. If youre keen to predict a continuous variable as opposed to 
a categorical one, then perhaps our Regression in Keras notebook (covered near the end of 
Chapter 9) would serve as an appropriate template. 

You could feed more or less unadulterated data into your deep learning model, or, if 
youve already extracted features from your raw data, there s certainly no harm in passing 


5. To do this, change the string argument you pass into gym. make( ) from Example 13.1. 

6. gym.openai.com/envs/#atari 
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Figure 14.2 The wide and deep model architecture concatenates together inputs from 
two separate legs. The deep leg receives more or less unadulterated (“raw”) input data 
and uses several data-appropriate (e.g., convolutional, recurrent, dense) layers of neurons 
to extract features automatically. Its “deep”-ness may resuit from having many layers of 
neurons. The wide leg, meanwhile, receives manually curated features (extracted from 
the raw data prior to modeling via expertly defined functions) as inputs. Its “wide"-ness 
may resuit from having many such features serving as inputs. 


these features in as inputs. Indeed, researchers from Google 7 have popularized a wide and 
deep modeling approach that handles existing engineered features while simultaneously 
learning new features from raw input data. See Figure 14.2 for a generalized schematic of 
this wide and deep approach, which incorporates the concat layer introduced at the end 
ofChapter 11 (see Example 11.41). 


Resources for Further Projects 

For moving beyond the initial projects suggested above, we maintain a directory of help- 
ful resources at jonkrohn.com/resources. There, we provide lmks to: 

■ Open data sources that are well organized and, in many cases, very large 

■ Recommended hardware and cloud infrastructure options for training larger deep 
learning models 

■ Compilations of key deep learning papers and implementations of the research 
covered within them 


7. bit.ly/wideNdeep 
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■ Interactive deep learning demos 

■ Examples of recurrent neural networks applied to times series predictions, such as 
financial applications 8 

Socially Beneficial Projects 

In particular, we'd like to draw your attention to the section of our resources page titled 
“Problems Worth Solving.” In this section, we list resources that summarize the most 
pressing global issues facing society in our time—issues that we encourage you to apply 
deep learning techniques toward solving. As an example, in one of these studies, 9 the 
authors—from the McKinsey Global Institute—examine 10 social-impact domains: 

1. Equality and inclusion 

2. Education 

3. Health and hunger 

4. Security and justice 

5. Information verification and validation 

6 . Crisis response 

7. Economic empowerment 

8 . Public and social sector 

9. The environment 

10. Infrastructure 

They go on to detail the prospective utility of niany of the techniques introduced in this 
book to each of these domains, including the following particular examples: 

■ Deep learning on structured data (the dense nets of Chapters 5 to 9): applicable to ali 
10 domains 

■ Image classification, including handwriting recognition (Chapter 10): all domains except 
public and social sector 

■ NLP , including sentiment analysis (Chapter 11): all domains except infrastructure 

■ Content generation (Chapter 12): applicable to the equality and inclusion domain as 
well as to the public and social sector domain 

■ Reinforcement learning (Chapter 13): applicable to the Health and hunger domain 

The Modeling Process, Including 
Hyperparameter Tuning 

With any of the deep learning project ideas weVe covered in this chapter, hyperparam¬ 
eter tuning is likely to prove key to your success. In this section, we provide you with a 
step-by-step modeling process that you can use as a rough template for your own projects. 
Bear in mind, however, that you may need to stray from our recommended procedural 


8. This topic is of great interest to many students of deep learning, but it’s beyond the scope of this book. 

9. Chui, M. (2018). Notes from the AI frontier: Applying AI for social good. McKinsey Global Institute, 
bit.1y/aiForGood 
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path in a number ofways because of the unique specifics ofyour particular project. For 
example, youre unlikely to proceed strictly linearly through the following steps: As you 
reach later steps in the process, you may develop a hunch 10 about how an earlier step 
inight be improved based on model behavior, * 11 so you’11 end up circling back and pro- 
ceeding through some of the steps several times—perhaps even dozens of times or more! 

Heres our rough step-by-step guide: 

1. Parameter initialization : As covered in Chapter 9 (see Figure 9.3), you should ini- 
tialize your models parameters with sensible random values. We recommend ini- 
tializing biases with zeros and initializing weights using Xavier Glorot s approach. 
Thankfully, with Keras, sensible layer initializations such as these will generally be 
handled for you automatically. 

2. Cost function selection: If youre solving a classification problem, you should probably 
be using cross-entropy cost. If youre solving a regression problem, then you should 
probably be using mean-squared-error cost. If youre interested in experimenting, 
however, keras . i o/1 osses offers further options. 

3. Get above chance: If your initial model architecture (which could be based directly 
on any of the models we went over in this book) attains below-chance performance 
on your validation data (e.g., <10 percent accuracy with the 10-class MNIST-digit 
data) then consider these tactics. 

■ Simplifying your problem: For example, if youre working with the MNIST dig- 
its, you could reduce the number of classes youre classifying from 10 down to 

2 . 

■ Simplifying your network architecture: Perhaps youre doing something rather silly 
and not realizing it. Or perhaps your model is too deep and your gradient of 
learning is vanishing severely. By simplifying your model architecture—such 
as by removing layers—you may bring these potential issues to light. 

■ Reducing your training set size: If you have a large training dataset, waiting for a 
single epoch to finish training could take a long time. By drastically subsetting 
your training sample, you can iterate and improve your model more rapidly. 

4. Layers: Once your model is learning to any extent, you can begin experimenting 
with your layers. You could try: 

■ Varying the number of layers: Following the guidelines discussed in Chapter 8 
(the section containing Figure 8.8), you could try adding or removing indi- 
vidual layers or blocks of layers (like the conv-pool blocks in Figure 10.10). 

■ Varying the types of layers: Depending on your particular problem and dataset, 
particular layer types might markedly outperform others. Consider, for 
example, the impact of the layer changes we made across our film-sentiment 
classifiers in Chapter 11 (see Table 11.6 for a summary). 


10. As you carry out more and more of your own deep learning projects, and as you examine more and more 
of other peoples high-performing model architectures (e.g., in GitHub, StackOverflow, and arXiv papers), you’11 
develop an intuition for how to adapt your models design and hyperparameters to a given problem. 

11. Model behavior can be studied by, for example, monitoring training- and validation-set loss as your model 
trains. This would be made easier by using TensorBoard (see Figure 9.8). 
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■ Varying layer undth: We recommend varying the number of neurons per layer 
by powers of 2, also as per Chapter 8 near Figure 8.8. 

5. Avoid overfitting : As discussed in Chapter 9, we recommend encouraging your model 
to generalize beyond your training dataset by employing dropout, data augmenta- 
tion (ifpossible, e.g., with image data), and/or batch normalization. Ifyou happen 
to be able to acquire additional, new training data for your model, that would likely 
be helpful, too. Finally, as we demonstrated countless times in Chapter 11, ifyour 
model does overfit during training, it’d be wise to reload the model weights from 

a previous epoch—probably the one in which the validation loss was lowest (see 
Figure 14.3 for an example). 

6 . Learning rate'. As per Chapter 9, you can tune your learning rate up or down. How- 
ever, “fancy” optimizers like Adam and RMSProp often manage to handle adjust- 
ing learning rate automatically on the fly. 12 

7. Batch size: This hyperparameter is likely to be one of the least impactful, so you can 
leave it to last. Refer back to Chapter 8 (near Figure 8.7) for guidance on tuning it 
up or down. 


0 . 5 - 



12 3 4 

epoch 


— training — validation 

Figure 14.3 A plot of training loss (red) and validation loss (blue) over epochs of model 
training. These particular results come from our Multi ConvNet Sentiment Classifier 
notebook (see the final section of Chapter 11), but this overfitting pattern is typical of 
deep learning models. After epoch 2, the training loss continues toward zero while the 
validation loss creeps upward. Epoch 2 has the lowest validation loss, so the parameters 
from that epoch should be reloaded for further model testing and (perhaps!) even for 

use in a production system. 


12. Note that there are exceptions to this. For example, we did find tuning learning rate to be impactful even with 
optimizers like Adam and RMSProp in Chapters 12 (with respect to GANs) and 13 (with respect to reinforcement 
learning agents). 
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Figure 14.4 A strictly structured grid search (shown in the left-hand panel) is less likely 
to identify optimal hyperparameters for a given model than a search over values that are 
sampled randomly over the same hyperparameter ranges (right-hand panel). 

Automation of Hyperparameter Search 

With ali of the hyperparameters that we could endlessly play around with for a given deep 
learning model, it should come as little surprise that developers (who are famously lazy!) 
have come up with approaches for automating hyperparameter search. In Chapter 13, 
we covered the use of SLM Lab for searching for hyperparameters in deep reinforce- 
ment learning models specifically; for deep learning models in general, we reconmiend 
Spearmint. 13 Note that regardless ofthe hyperparameter-search approach that you decide 
to go with, James Bergstra and Yoshua Bengio 14 from the University ofMontreal have 
provided evidence that selecting random values for your hyperparameters is more likely to 
identify optimal hyperparameters for your model than a rigidly structured grid search; see 
Figure 14.4. 15 

Deep Learning Libraries 

Throughout this book, we have used Keras to construet and run our deep learning mod¬ 
els. There are, however, countless other deep learning libraries, and more pop up every 
year. In this section, we review the other leading options you have. 

Keras and TensorFlow 

TensorFlow is perhaps the best-known deep learning library, its nanie derived from the 
concept of tensors (arrays of Information, e.g., model inputs x or activations a) flowing 
through operations (e.g., those that define the mathematies of artificial neurons like our 
“most important equation” from back in Figure 6.7). The TensorFlow library was origi- 
nally developed for internal use at Google, and the tech giant open-sourced the project in 


13. Snoek, J., et al. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural 
Information Processing Systems, 25. Code available at github.com/JasperSnoek/spearmint. 

14. Figure 1.10 provides a portrait of Bengio. 

15. Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning 
Research, 13, 281-305. 
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Figure 14.5 The relative Google search frequency (from October 2015 to February 
2019) of five of the most popular deep learning libraries 


2015. Figure 14.5 illustrates the relative interest in five ofthe most popular deep learning 
libraries, as per frequency of Google searches. Keras is the ciear runner-up, with Tensor- 
Flow in the lead. Given this, you might be particularly interested in learning how to use 
TensorFlow. Well, we have good news for you: You already know how to. 

Not only is Keras the high-level API that we’ve been using throughout this book 
to call TensorFlow in the background, but also—as ofthe release of TensorFlow 2.0 in 
2019—Keras layers are the recommended approach for building models directly with the 
TensorFlow library itself. To build TensorFlow models in earlier versions of the Software, 
it was necessary to become familiar with a fairly abstruse three-step process: 

1. Configuring a detailed “computational graph” 

2. Initializing this computational graph within a “session” 

3. Feeding data into the session while fetching information you’d like to access (e.g., 
summary metrics, model parameters) out of the session 

This relatively esoteric process was in place because it enabled TensorFlow to optimize the 
execution of deep learning model training and production-time execution across as many 
devices (CPUs and GPUs, perhaps spread across multiple servers) as are made available to 
it. In time, the developers behind libraries like PyTorch devised Creative mechanisms to 
facilitate the best of both worlds: 

1. The conceptually simple, layer-focused, instantly executable building of deep learn¬ 
ing models and simultaneously . . . 

2. The highly optimized model execution across however many devices are available 

The team behind TensorFlow responded by more tightly incorporating Keras layers and 
by creating Eager mode —an approach to enable immediate execution (in place of the pre- 
vious three-step process) without sacrificing performance. Prior to TensorFlow 2.0, Eager 
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mode needed to be activated in order to use it, 16 but from 2.0 onward it’s the built-in 
default. 

Converting any of the code we covered in this book from being run with the Keras 
library to being run within TensorFlow itself is painless. For example, have a look at 
our Deep Net in TensorFlow notebook, which is identical to our Deep Net in Keras note- 
book (from Chapter 9) except for how the dependencies are loaded (Example 14.1, as 
compared to Examples 5.1 and 9.4). 

Example 14.1 Dependencies for building a Keras layer-based deep net in 
TensorFlow without loading the Keras library 

import tensorflow as tf 

from tensorflow.python.keras.datasets import mnist 

from tensorflow.python.keras.models import Sequential 

from tensorflow.python.keras.1ayers import Dense, Dropout 

from tensorflow.python.keras.1ayers import BatchNormalization 

from tensorflow.python.keras.optimizers import SGD 

from tensorflow.python.keras.utils import to_categorical 

From there, you can begin exploring the added functionality and flexibility that Tensor- 
Flow offers. 

Particular reasons you might use TensorFlow with Keras layers instead of the high-level 
Keras API alone include: 

■ Customizing your model to your heart’s content, including by subclassing 

tf . keras . Model to forward propagate through your model in any way that you 
fancy 17 

■ Creating high-performance data-input pipelines by using tf. data 

■ Deploying your model to 

■ High-performance systems on servers with TensorFlow Serving 

■ Mobile or embedded devices with TensorFlow Lite 

■ A web browser with TensorFlow.js 


PyTorch 

PyTorch is the cousin of a machine learning framework called Torch, which is based in 
the programming language Lua. It’s really an extension of Torch, designed to feel fast and 
intuitive in the much more widely used Python language. PyTorch is developed primarily 
by the Facebook AI Research group led by Yann LeCun (Figure 1.9). Although not quite 
as popular as TensorFlow or Keras, PyTorch has gained a lot of traction in a short period 
of time (see Figure 14.5) and with good reason, as we elaborate on here. 


16. With a single line of code: tf. enabl e_eager_execution (). 

17. See tensorflow.org/guide/keras#model_subclassing. 
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Many high-level deep learning libraries (including Keras) serve as simple wrappers for 
low-level code (sometimes in Python, and sometimes in other languages such as C); how- 
ever, PyTorch is not a simple Python wrapper for Torch. Rather, PyTorch was completely 
rewritten and specifically tailored to feel native to people familiar with Python, while 
retaining the computational efficiency of the original Torch library. 

At its core, PyTorch performs matrix operations, much like NumPy. Indeed, 

PyTorch’s tensors are compatible with most NumPy operations, and methods exist for 
converting between NumPy arrays and PyTorch tensors. Because of this deep integration 
with NumPy, custom layers can be written directly in Python if this extra flexibility is 
desired. Unlike NumPy, however, PyTorch has specific systems in place to execute its 
calculations on GPUs, thereby leveraging these processors’ massively parallel matrix- 
calculation capabilities. Additionally, acceleration libraries are built in, which helps to 
make PyTorch fast regardless of device, and custom memory allocators enable it to be 
memory efficient. 

If you’d like to learn more, in Appendix C we delve into many of the features of the 
PyTorch library. We compare and contrast it with TensorFlow, and we provide a hands- 
on demonstration oftraining a deep learning model. As you’11 see, the syntax of PyTorch 
is similar to that of Keras, and you should be able to pick it up fairly quickly ifyou so 
desire. 

MXNet, CNTK, Caffe, and So On 

Beyond Keras, TensorFlow, and PyTorch, there are myriad other deep learning libraries 
out there. Examples include: 

■ MXNet, which was developed by Amazon. 

■ CNTK, the Microsoft Cognitive Toolkit. 

■ Caffe, out of the University of Berkeley, which is designed exclusively for machine 
vision/CNN applications. Caffe2, its lightweight successor, was being developed by 
Facebook AI Research, but it was folded into FAIRs PyTorch project in 2018. 

■ Theano, a University ofMontreal project that once rivaled TensorFlow as the lead- 
ing deep learning library, but is no longer in development largely because many of 
its developers jumped ship to Google’s TensorFlow project. 

All of these libraries—indeed, any popular ones—are open-source. In addition, because 
the vast majority of these libraries follow the layer-focused design of Keras and have a 
similar syntax, you should have little trouble recognizing their code and employing them 
yourself if you have the inchnation to. 

Software 2.0 


The models that all of the available deep learning libraries facilitate are revolutionizing 
the world of Software. In a widely shared blog post written by the prominent data scientist 
Andrej Karpathy (Figure 14.6), 18 he argues that deep learning is facilitating “Software 


18. bit.1y/AKsoftware2 
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Figure 14.6 Andrej Karpathy is the director of AI at Tesla, the California-based 
automotive and energy firm. We mentioned him once earlier—in a footnote in 
Chapter 10. Karpathy’s background spans many institutions mentioned in this book, 
including OpenAI (Figure 4.13 and Chapter 13), Stanford University (where he completed 
his PhD under Fei-Fei Li; see Figure 1.14), DeepMind (e.g., Figures 4.4 through 4.10), 
Google (countless mentions across this book, including as the developers behind 
TensorFlow), and the University of Toronto (e.g., Figures 1.16 and Figure 3.2). 


2.0.” Software 1.0 is what Karpathy describes as classic computer prograniming languages 
like Python, Java, JavaScript, C++, and so on. With Software 1.0, we need to provide 
explicit instructions within a computer program in order to have the computer produce 
outputs in the desired manner. 

Software 2.0, in contrast, consists of deep learning models that approximate functions, 
like the functions we approximated in this book in order to classify handwritten digits, 
predict house prices, analyze the sentiment of film reviews, generate sketches of apples, 
and converge on Q* in order to play the Cart-Pole game. The millions or billions of 
parameters in productionized deep learning models today are increasingly demonstrating 
themselves to be more adaptable, useful, and powerful than hard-coded Software 1.0. 
Software 2.0 doesn’t replace Software 1.0, however: It builds on top ofit, with Software 
1.0 providing all ofthe critical digital infrastructure that Software 2.0 exists within. 

Some of the particular advantages of Software 2.0 covered by Karpathy are: 

1. Computational homogeneity: Deep learning models are made up of homogenous 
units—such as ReLU neurons—enabling matrix computations with these units to 
be highly optimizable and scalable. 

2. Constant running time : Once making inferences in production systems, a given deep 
learning model will use the same amount of compute regardless of the input fed 
into it. Software 1.0 approaches, which could involve countless if-else statements, 
could require widely varying amounts of compute depending on the particular 
input fed into it. 
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3. Constant memory use: For the same reasons as the preceding point, a given deep 
learning model in production requires the same amount of memory resources 
regardless of the particular input fed into it. 

4. Easy: By reading this one book, youve developed the skills to create high- 
performing algorithms across a range of domains. Prior to the advent of deep 
learning, markedly more domain-specific expertise would have been required to do 
this in each individual domain. 

5. Superior: As we review in the next paragraph, deep learning models can dramatically 
outperform other approaches. 

In light of these points, let’s review the applications that were featured in Part III of 
this book: 

■ Machine vision (e.g., the MNIST digit recognition from Chapter 10): With tra- 
ditional machine learning, this required hard-coding visual features extensively, 
typically requiring years of expertise in the field. Deep learning models perform 
better (recall Figure 1.15), learn features automatically, and require little vision- 
specific expertise to deploy effectively. 

■ Natural language processing (e.g., the sentiment analysis from Chapter 11): In the 
traditional machine learning approach, many years of linguistics experience would 
typically be required to build an effective algorithm, including an understanding of 
the unique syntax and semantics of any given language involved in the application. 
Here too, deep learning models tend to perform better (as suggested by Figure 2.3). 
They learn the relevant features automatically, and again they require minimal 
linguistics-specific expertise to use effectively. 

■ Simulating art and visual imagery (e.g., the drawings in Chapter 12): Generative 
adversarial networks, which incorporate deep learning models, produce far more 
compelling and realistic images than any preexisting approaches. 19 

■ Game-playing (e.g., the Deep Q-Learning networks in Chapter 13): A single 
algorithm, AlphaZero, can crush any Software 1.0 or traditional machine learning 
approach to playing Go, chess, and shogi (as shown in Figure 4.10). Remarkably, it 
does so more efficiently and doesn’t require any training data. 

Approaching Artificial General Intelligence 

Recalling the development of vision in trilobites from Chapter 1 (Figure 1.1), many 
millions of years passed before biological life evolved the sophisticated, full-color visual 
systems that primates like us benefit from. In contrast, it was a matter of decades from the 
first computer-vision systems (Figure 1.8) to ones that could match or exceed the perfor- 
mance of humans at visual-recognition tasks (Figure 1.15). 211 Whereas image classification 
is a classic example of artificial narrow intelligence (ANI), rapid advancements such as 


19. Visit distill .pub/2017/aia to experience an interactive article by Shan Carter and Michael Nielsen that 
expounds on how GANs can be used to augment human intelligence. 

20. The human-accuracy benchmark in Figure 1.15 is Andrej Karpathy (Figure 14.6) himself, by the way. 



Approaching Artificial General IntelIigence 327 


this lead many researchers to believe that artificial general intelligence (AGI) and maybe 
even artificial super intelligence (ASI) can be attained in our lifetimes. 21 The Mulier 
and Bostrom survey results we mentioned back in Chapter 1, for example, have median 
estimates of 2040 and 2060 for the genesis of AGI and ASI, respectively. 

Four primary factors are driving our rapid advances in ANI and also may be hurrying 
us in the direction of AGI or ASI: 

1. Data: In recent years, the amount of data in the digital realm doubles about 
every 18 months. This exponential rate of growth shows no sign of slowing (re- 
call from Chapter 13, for example, the relentless swell of data produced by an 
individual autonomous vehicle). A lot of the data is low quality, but—as with the 
open data sources we mentioned earlier in this chapter—datasets are becoming 
larger, cheaper to store, and often better organized (ImageNet from Chapters 1 and 
10 is an exemplar). 

2. Computing power. Although the rate of performance improvements on individual 
CPUs may slow down in coming years, 22 the massive parallelization of matrix 
operations within GPUs and across many servers—each with multiple CPUs and 
perhaps multiple GPUs—will continue to increase the ready availability of compute. 

3. Algorithms: A rapidly enlarging army of data-focused scientists and engineers—who 
are global, and who are spread across both the academic and commercial realms—is 
tweaking the techniques used to mine datasets for meaningful patterns. Every once 
in a while, there is a breakthrough like AlexNet (Figure 1.15). In recent years, deep 
learning has been associated with the bulk of these breakthroughs, many of which 
we’ve covered over the course of this book. 

4. Infrastructure: The Software 1.0 infrastructure such as open-source operating Sys¬ 
tems and programming languages, paired with Software 2.0 libraries and techniques 
(shared in nearly real time, worldwide, via arXiv and GitHub) and the low cost of 
cloud-computing providers (e.g., Amazon Web Services, Microsoft Azure, Google 
Cloud Platform) provide a highly scalable hotbed for approaches to be experi- 
mented with on ever-larger datasets. 

The cognitive tasks that humans tend to find hard (e.g., playing chess, solving ma¬ 
trix algebra problems, optimizing a financial portfolio) are generally the ones that Homo 
sapiens have been doing for only thousands of years or fewer; these are the types of tasks 
that today tend to be easy for machmes. In contrast, the cognitive tasks humans fmd easy 
(e.g., reading social cues, carrying an infant safely up the stairs) evolved over millions 
of years and today remain beyond the reach ofmachines. So despite ali of the justifiable 
excitement around machine learning, the possibility of AGI could be a long way off 
and remains only a theoretical possibility at this time. Some examples of the significant 
barriers that prevent contemporary deep learning from bringing about AGI include: 23 


21. Refer back to the end of Chapter 1 for a refresher on ANI, AGI, and ASI. 

22. Moore’s “law” is anything but a law, and the shrinking of transistors down toward electron scale makes de- 
creasing the cost of computation on a given chip trickier and trickier. 

23. For more on the limitations of deep learning, read Marcus, G. (2018). Deep learning: A critical appraisal. arXiv: 
1801.00631. 
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■ Deep learning requires training on many, many samples. These large datasets are not 
always available, and, in stark contrast, biological learning systems—including those 
in mice and human infants—can often learn from a single example. 

■ Deep learning models are typically a black box. Although investigative techniques 
like Jason Yosinski and colleagues’ DeepViz tool 22 exist, these are the exceptions to 
the rule. 

■ Deep learning models don’t leverage knowledge of the world; they don’t, for exam¬ 
ple, take into account databases offacts when they make inferences. 

■ To deep learning models, a predicted correlation between some input x and some 
outcome y provides no assessment of causation. Being able to move beyond pre- 
dicting correlations between variables toward causal relationships between them is 
presumably critical to the development of general intelligence. 

■ Deep learning models are often susceptible to unintuitive and embarrassing fail- 
ures, 25 and they can be deliberately duped by changes to even a single pixel in an 
input image. 26 

Perhaps some of these barriers catch your own interest, and you could considet' dedicating 
some ofyour career to contributing to devising Solutions! We can’t know precisely what 
the future will hold, but given the explosions of data, compute, algorithms, and infras- 
tructure, one prediction were confident making is that you should have little difficulty 
identifying exhilerating opportunities to apply deep learning. 

Summary 

This chapter wrapped up the book by providing you with project ideas, resources for fur- 
ther learning, a general guide to fitting models, an overview of the deep learning models 
available to you beyond Keras, and an exploration of the ways artificial neural networks 
are rapidly reshaping Software—with much more potential excitement in the years to 
come! 

Have fun going forward, and please do stay in touch: 

■ Here s a Twitter account we use for posting about new content as it’s released, in¬ 
cluding new studio-recorded video tutorials we anticipate publishing to accompany 
the matenal covered m this book: twi tter . com/JonKrohnLearns 

■ We use Medium for long-form blog posts: medi um. com/@j onkrohn 

■ We set up a Google Group to enable readers of this book to ask questions and have 
them answered by other readers (perhaps even us!) in a forum format. You can find 
it here: bi t. 1 y/DLIforum 


24. bit.1y/DeepViz 

25. Go to bi t. 1 y/googl eGaf f e for an infamous example. 

26. Deliberately misleading a machine learning algorithm is called an adversarial attack, and it is carried out by 
inputting an adversarial example. There are many papers on these; one outlining single-pixel adversarial attacks on 
CNNs is Su, J., et al. (2017). One pixel attack for fooling deep neural networks. arXiv: 1710.08864. 
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Figure 14.7 Trilobyte waving good-bye 


■ And, finally, feel free to add us 011 Linkedln (e.g., linkedin.com/in/jonkrohn), 
but be sure to mention you’re a reader because we don’t accept requests from just any- 
one! 

We hope you enjoyed this visual, interactive introduction to deep learning. We’re 
deeply grateful for the time and energy you invested in this journey that youve taken 
alongside us. Farewell, dear friend—from the amiable trilobite in Figure 14.7. 
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A 


Formal Neural NetWork 

Notation 


o keep discussio» of artificial tieurons as straightforward as possible. ili this book we 
used a shorthand notation to identify them within a network. In this appendix, we lay out 
a more widely used formal notation, which may be of interest ifyou’d like to: 

■ Possess a more precise manner for describing neurons 

■ Follow closely the backpropagation technique covered in Appendix B 

Taking a look back at Figure 7.1, the neural network has a total of four layers. The 

first is the input layer, which can be thought of as a collection of starting blocks for each 
data point to enter the network. In the case of the MNIST models, for example, there are 
784 such starting blocks, representing each of the pixels in a 28x28—pixel handwritten 
MNIST digit. No computation happens within an input layer; it sirnply holds space for 
the input values to exist in so that the network knows how many values it needs to be 
ready to compute on in the next layer. 1 

The next two layers in the network in Figure 7.1 are hidden layers, in which the bulk 
of the computation within a neural network occurs. As we’ll soon discuss, the input val¬ 
ues x are mathematically transformed and combined by each neuron in the hidden layer, 
outputting sonte activation value a. Because we need a way to address speciftc neurons in 
speciftc layers, we’ll use superscript to define a layer, starting at the first hidden layer, and 
subscript to deline a neuron in that layer. In Figure 7.1, then, we’d have a\, a\, and aj 
in the first hidden layer. In this way, we can precisely refer to an individual neuron in a 
speciftc layer. For example, a\ represents the second neuron in the second hidden layer. 

Because Figure 7.1 is a dense network, the neuron aj receives inputs from all ofthe 
neurons in the preceding layer, namely the network inputs X\ and X 2 . Each neuron has 
its own bias, b, and we 11 label that bias in exactly the same manner as the activation a: 

For example, b\ is the bias for the second neuron in the first hidden layer. 


1. For this reason, we usually don’t need a means to address a particular input neuron; they have no weights or 
biases. 
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The green arrows in Figure 7.1 represent the mathematical transformation that takes 
place during forward propagation, and each green arrow has its own individual weight 
associated with it. In order to refer to these weights directly, we eniploy the following 
notation: wh 0) is the weight in the first hidden layer (superscript) that connects neu¬ 
ron a\ to its input X 2 in the input layer (subscript). This double-barreled subscript is 
necessary because the network is fully connected: Every neuron in a layer is connected 
to every neuron in the layer before it, and that connection carries its own weight. Let’s 
generalize this weight notation: 

■ The superscript is the hidden-layer number of the input-receiving neuron. 

■ The first subscript is the number of the neuron receiving the input within its hid¬ 
den layer. 

■ The second subscript is the number of the neuron providing input from the preced- 
ing layer. 

As a further example, the weight for neuron a| will be denoted wf 2 . ) where i is a neu¬ 
ron in the preceding layer. 

At the far right of the network, we finally have the output layer. As with the hidden 
layers, output-layer neurons have weights and a bias, and these are labeled in the sanie 
way. 


B 

B ackprop agation 


I n this appendix, we use the formal neural network notation from Appendix A to dive 
into the partial-derivative calculus behind the backpropagation method introduced in 
Chapter 8. 


Let’s begin by defining some additional notation to help us along. Backpropagation 
works backwards, so the notation is based on the final layer (denoted L ), and the earlier 
layers are annotated with respect to it (L — 1, L — 2, . . . L — n). The weights, biases, and 
outputs from functions are subscripted appropriately with this same notation. Recall from 


Equations 7.1 and 7.2 that the layer activation a L is calculated by multiplying the pre- 
ceding layers activation (a L ~ 1 ) by the weight w L and bias b L terms to produce z L and 


passing this through an activation function (denoted simply as er here). Also, we imple- 
ment a simple cost function at the end; here we’re using Euclidean distance. Thus, for the 
final layer we have: 



(B.l) 

(B.2) 

(B.3) 


In every iteration, we need the gradient of the total error from the preceding layer 
(dC/da L ); in this way, the total error ofthe system is propagated backwards. We'll call 
this value 5l■ Because backpropagation runs back-to-front, we start with the output 
layer. This layer is a special case given that the error originates here in the forni of the cost 
function and there are no layers above it. Thus, Sl is given as follows: 


h = ^-r=2(a L -y) 


(B.4) 


Again, this is a special case for the initial 5 value; the remaining layers will be different 


(more on that shortly). Now, to update the weights in layer L we need to find the gradi¬ 


ent ofthe cost w.r.t. (with respect to) the weights, 9C /dw L . According to the chain rule, 


this is the product of the gradient of the cost for the layer before w.r.t. its output, the 
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gradient of the activation function w.r.t. z, and the gradient of the z w.r.t. the 
weights w L \ 


dC _dC_ daP_ dz L 
dw L da L dz L dw L 


(B.5) 


Since ® c /da L = 5 l (Equation B.4), this equation can be simplified to: 

= S L ■ a i_1 (l - a i_1 ) • a^ 1 (B.6) 

This value is essentially the relative amount by which the weights at layer L affect the 
total cost, and we use this to update the weights at this layer. Our work isn’t complete, 
however; now we need to continue down the rest of the layers. For layer L — 1: 

_ dC _ dC_ da^_ dz L 
L 1 <9a L_1 da L dz L da L ~ 1 

Again, dc /da L = 5l (Equation B.4). In this way, the total error is being incorporated 
down the line, or backpropagated. The remaining ternis have derivatives, so the equation 
becomes: 


S L-i = g^z I =S L -a L (l-a L )-w L (B.8) 

Now we need to find the gradient of the cost w.r.t. the weights at this layer L — 1 
as before: 


dC _ dC 5a i_1 dz L ~ x 
dw L ~ x da L ~ x dz L_1 dw L ~ 1 


(B.9) 


Once again, substituting 6l—i for ® c /da L 1 (Equation B.8) and taking the derivatives of 
the other ternis, we get: 

(B.10) 

This process is repeated layer by layer all the way down to the first layer. 

To recap, we first find Sl (Equation B.4) which is the error of the cost function 
(Equation B.3), and we use that value in the equation for the derivative of the cost 
function w.r.t. the weights in layer L (Equation B.6). In the next layer, we find S^—i 
(Equation B.8)—the gradient ofthe cost w.r.t. the output of layer L — 1. As before, this 
is used in the equation to calculate the gradient of the cost function w.r.t. the weights in 
layer L — 1 (Equation B.10). And so on; backpropagation continues until we reach the 
model inputs. 

Up to this point in this appendix, we’ve only dealt with networks with single inputs, 
single hidden neurons, and single outputs. In practice, deep learning models are never this 
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simple. Thankfully, the math shown above scales straightforwardly given multiple neurons 
in a layer and multiple network inputs and outputs. 

Consider the case where there are multiple output classes, such as when youre classify- 
ing MNIST digits. In this case, there are 10 output classes (n = 10) representing the digits 
0-9. For each class, the model provides a probability that a given tnput image belongs to 
that class. To find the total cost, we fmd the sum of the (quadratic, in this case) cost over 
ali the classes: 


n 

0 > = 5>*-»») 2 ( B - n ) 

n =1 

In Equation B.l 1, a L and y are vectors, each containing n elements. 

Examining dG /dw L for this, the output layer, we must account for the fact that there 
may be many neurons in the fmal hidden layer, each one of them connected to each 
output neuron. Ifs helpful here to switch the notation slightly: Let the final hidden layer 
be i and the output layer be j. In this way, we have a matrix of weights that can be 
accessed with a row for each output neuron and a column for each hidden-layer neuron, 
and each weight can be denoted Wji. So now, we fmd the gradient on each weight 
(remember, there are i X j weights: one for each connection between each neuron in 
the two layers): 


ac _ dC daf dzf 
dwf daf dzf dwf 

We do this for every single weight in the layer, creating the gradient vector for the 
weights ofsize i x j. 

Although this is essentially the same as our single-neuron-per-layer backprop (refer to 
Equation B.7), the equation for the gradient of the cost w.r.t. the preceding layers output 
cll-1 W1 U change (i.e., the 8l—i value). Because this gradient is composed of the partial 
derivatives of the current layer’s inputs and weights, and because there are now multiple of 
those, we need to sum everything up. Sticking with the i and j notation: 


$L-1 = 


dC 

'W 1 


y. 1 dC daf dzf 

dal > dz f d °t" 


(B.13) 


This is a lot of math to take in, so lefs review in simple terms: Relative to the simpler 
network ofEquations B.l through B.10, the equations haven’t changed except that in- 
stead of calculating the gradient on a single weight, we need to calculate the gradient on 
multiple weights (Equation B.l2). In order to calculate the gradient on any given weight, 
we need that S value—which itself is composed of the error over a number of connec- 
tions in the preceding layer—so we calculate the sum over all these errors (Equation 
B.13). 
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In this appendix, we’ll introduce the distinguishing elements of PyTorch, including con- 
trasting it with its primary competition—TensorFlow. 

PyTorch Features 

In Chapter 14, we introduced PyTorch at a high level. In this section, we continue by 
examining the librarys core attributes. 

Autograd System 

PyTorch operates using what s called an autograd system, which relies on the principle of 
reverse-mode automatic differentiation. As detailed in Chapter 7, the end product of 
forward propagating through a deep neural network is the resuit of a series of functions 
chained together. Reverse-mode automatic differentiation applies the chain rule to differ¬ 
entiate the inputs with respect to the cost at the end, working backwards (introduced in 
Chapter 8 and detailed in Appendix B). At each iteration, the activations of the neurons 
in the network are computed by forward propagation, and each function is recorded on 
a graph. At the end of training, this graph can be computed backwards to calculate the 
gradient at each neuron. 

Define-by-Run Framework 

What makes autograd especially interesting is the define-by-nm nature of the framework: 
The calculations for backpropagation are defined with each forward pass. This is impor¬ 
tant because it means that the backpropagation step is only dependent on how your code 
is run, and as such the backpropagation mathematics can vary with each forward pass. 

This means that every round of training (see Figure 8.5) can be different. This is useful in 
settings such as natural language processing, where the input sequence length is typically 
set to the maximum length (i.e., the longest sentence in the corpus) and shorter sequences 
are padded with zeros (as we did in Chapter 11). PyTorch, in contrast, natively supports 
dynamic inputs, circumventing the need for this truncating and padding. 

The define-by-run framework also means that the framework is not asynchronous. 
When a line is executed, the code is run, making debugging much simpler. When the 
code throws an error, youre able to see exactly which line caused the error. Furthermore, 
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by running an appropriate helper function, this so-called eager execution can be easily 
replaced with a traditional graph-based model—wherein the graphs are defmed in 
advance, which brings with it speed and optimization benefits. 

PyTorch Versus TensorFlow 

You might now wonder when one niight select PyTorch over TensorFlow. The answer 
is not unambiguous, but we’ll explore some of the advantages and disadvantages of each 
library here. 

One relevant topic is adoption: TensorFlow is currently more widely used than 
PyTorch. PyTorch was first released to the public in January 2017, whereas TensorFlow 
was released a little over a year prior, in November 2015. In the rapidly developing world 
of deep learning, this is a significant head start. Indeed, the 1.0.0 version of PyTorch was 
only released on December 7, 2018. In this way, TensorFlow gained traction and a large 
body of tutorials and Stack Overflow posts emerged online, giving Google s library an 
edge. 

A second consideration is that PyTorch’s dynamic interface makes iteration eas- 
ier and quicker relative to the static nature of TensorFlow. 1 With PyTorch you can 
define, change, and execute nodes as you go, as opposed to defining the entire model in 
advance. Debugging is significantly easier in PyTorch, largely because graphs are defmed 
at run time. This means that errors occur when the code is executed and are more easily 
traceable to the offending line of code. 

Visualization in TensorFlow is intuitive and easy with the built-in TensorBoard plat- 
form (see Figure 9.8). However, TensorBoard integrations with PyTorch do exist, and 
data are more implicitly available during PyTorch model training, so custom Solutions can 
be built using other libraries (for example, with matplotlib). 

TensorFlow is used in both development and production at Google, and for this reason 
the library has much more sophisticated deployment options, including mobile support 
and distributed training support. PyTorch has historically lagged in these departments; 
however, with the release of PyTorch 1.0.0, a new just in time (JIT) compiler and its 
new distributed library are available to address these shortcomings. Additionally, all of 
the major cloud providers have announced PyTorch integrations, including ones with 
TensorBoard and TPU support on Google Cloud! 2 

When it comes to everyday use, PyTorch feels more “Pythonic” than TensorFlow: It 
was written specifically as a Python library, and so it will feel familiar to Python devel- 
opers. While TensorFlow has an established Python implementation thats widely used, 
the library was originally written in C++, and so its Python implementation can feel 
cumbersome. Of course, Keras exists to try to solve this problem, but in the process it 
obscures some of TensorFlow’s functionahty. 3 On the topic of Keras, PyTorch has the 


1. The Eager mode Central to TensorFlow 2.0 intends to remedy this. 

2. One might have expected Google to drag its feet on integrations with the library of one of its primary 
competitors—in this case, Facebook. 

3. TensorFlow 2.0 s tight coupling with Keras therein intends to correct many of these issues. 
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Fast.ai library, 4 which aims to provide high-level abstractions to PyTorch that are analo- 
gous to those provided by Keras to TensorFlow. 

Taking all of these topics into account, ifyou Ve doing research or ifyour in- 
production execution demands are not very high, PyTorch might be the optimal choice. 
The speed of iteration when experimenting, coupled with simpler debugging and 
extensive NumPy integration, make this library well suited to research. However, if 
youre deploying deep learning models into a production environment, you’ll fmd more 
support with TensorFlow. This is especially the case if youre using distributed training or 
performing inference on a mobile platform. 

PyTorch in Practice 

In this section, we go over the basies of PyTorch installation and use. 

PyTorch Installation 

Alongside TensorFlow and Keras, PyTorch is one of the libraries in the Docker Container 
we recommended installing 5 for running the Jupyter notebooks throughout this book. So, 
if you followed those instructions, youre already all set. If youre working outside of our 
recommended Docker setup, then you can consuit the installation notes that are available 
on the PyTorch homepage. 6 

The Fundamental Units Within PyTorch 

The fundamental units within PyTorch are tensors and variables, which we describe in 
turn here. 

Basic Operations with Tensors 

As in TensorFlow, tensor is little more than a fancy name for a matrix or vector. Tensors 
are functionally the same as NumPy arrays, except that PyTorch provides specific methods 
to perform computation with them on GPUs. Under the hood, these tensors also keep a 
record of the graph (for the autograd system) and the gradients. 

The default tensor is usually FloatTensor. PyTorch has eight types of tensors, which 
contain either integers or floats. When you define which type of tensor you cl like to use, 
that choice has memory and precision implications; 8-bit integers can only store 256 val- 
ues (i.e., [0 : 255]) and occupy much less memory than 64-bit 7 integers. However, in 
cases where, say, integers up to 255 are all that is required, using higher-order integers 
would be unnecessary. This consideration is especially relevant when you Ve running 
models on GPU architectures, because memory is generally the limiting factor on GPUs, 
as compared to running models on the CPU, where installing more RAM is relatively 
cheap. 


4. github.com/fastai/fastai 

5. See the beginning of Chapter 5 for these instructions. 

6. pytorch.org 

7. 64-bit integers can store values as large 2 63 — 1, which is 9.2 quintillion. 
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import torch 

x = torch.zeros (28 , 28, 1, dtype=torch.ui nt8) 
y = torch.randn (28 , 28, 1, dtype=torch.f1oat32) 

This code (which is available in our PyTorch Jupyter notebook, along with ali ofthe other 
examples in this appendix) creates a 28x28x1 tensor, x, thats filled with zeros, ofthe 
type ui nt8. 8 You could also have used torch . ones () to create a comparable tensor filled 
with ones. The second tensor, y, contains random numbers from the Standard normal 
distribution. 9 By defmition, these cannot be 8-bit integers, so we specified 32-bit floats 
here. 

As mentioned initially, these tensors have a lot in common with NumPy n- 
dimensional arrays. For example, it s easy to generate a PyTorch tensor from a NumPy 
array with the torch . f rom__numpy () method. The PyTorch library also contains many 
math operations that can be efficiently performed on these tensors, many of which mirror 
their NumPy counterparts. 

Automatic Differentiation 

PyTorch tensors can natively store the computational graph for the network as well as the 
gradients. This is enabled by setting the requi res_grad argument to True when you cre¬ 
ate the tensor. Now, each tensor has a grad attribute that Stores the gradient. Initially, this 
is set to None until the tensors backward () method is called. The backward () method 
reverses through the record of operations and calculates the gradient at each point in the 
graph. After the first call to backward (), the grad attribute becomes filled with gradient 
values. 

In the following code block, we define a simple tensor, perform some mathematical 
operations, and call the backward () method to reverse through the graph and calculate 
the gradients. Subsequently, the grad attribute will store gradients. 

import torch 

x = torch.zeros(3, 3, dtype=torch.float32, requi res_grad=True) 

y = x - 4 
z = y* *3 * 6 
out = z.mean() 

out.backward() 

print(x.grad) 


8. The “u” in ui nt8 stands for unsigned , meaning that these 8-bit integers span from 0 to 255 instead of from —128 
to 127. 

9. The Standard normal distribution has a mean of 0 and a Standard deviation of 1. 
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Because x had its requi re_grad flag set, we can perform backpropagation on this series 
of computations. PyTorch has accumulated the functions that generated the final output 
using its autograd system, so calling out. backward () will calculate the gradients and store 
them in x. grad. The final line prints the following: 

tensor([[32., 32., 32.], 

[32., 32., 32. ] , 

[32., 32., 32.]]) 

As this example demonstrates, PyTorch takes the hassle out of automatic differentiation. 
Next, we cover the basies of building a neural network in PyTorch. 

Building a Deep Neural Network in PyTorch 

The essential paradigm of building neural networks should be familiar: They consist of 
multiple layers that are stacked together (as in Figure 4.2). In the examples throughout 
this book, we used the Keras library as a high-level abstraction over the raw TensorFlow 
functions. Similarly, the PyTorch nn module contains layerlike modules that receive 
tensors as inputs and return tensors as outputs. In the following example, we build a 
two-layer network akin to the dense nets we used to classify handwritten digits in Part II: 

import torch 

# Define random tensors for the inputs and outputs 

x = torch.randn (32 , 784, requi res_grad=True) 
y = torch.randint(iow=0, high=1 0, size=(32,)) 

# Define the model , using the Sequential ciass 

modei = torch.nn.Sequential( 
torch.nn.1 i near (784 , 100), 
torch.nn.Sigmoid(), 
torch.nn.Linear(1 00 , 10), 
torch.nn.LogSoftmax(dim=1) 

) 

# Define the optimizer and loss function 

optimizer = torch.optim.Adam(mode1.parameters()) 

1oss_fn = torch.nn.NLLLoss() 

for step in range(IOOO): 

# Make predictions by forward propagation 
y_hat = model(x) 

# Calculate the loss 
loss = 1oss_fn(y_hat, y) 

# Zero-out the gradient before performing a backward pass 
optimizer.zero_grad() 
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# Compute the gradients w.r.t. the loss 

1oss.backward() 

# Print the resuits 

print ( 'Step: :4d - loss: :0.4f format(step+1, 1oss.item())) 

# Update the model parameters 

optimizer.step() 


Let’s break this down step by step: 

■ The x and y tensors are placeholders for the input and output values of the model. 

■ We use the Sequenti al class to begin building our model as a series oflayers 

(1 i near () through to LogSoftmax ()), in much the same way as we did in Keras. 

■ We initialize an optimizer; in this case we use Adam with its default values. We 
also pass into the optimizer ali of the tensors we cl like optimized—in this case, 
model.parameters(). 

■ We also initialize the loss function, although it doesn’t require any parameters. We 
opted for the built-in negative log-likelihood loss function, torch . nn . NLLLoss () , 10 

■ We manually iterate over the number of rounds of training (Figure 8.5) that we’d 
like to take (in this case, 1000), and during each round we 

■ Calculate the model outputs using y„hat = model (x). 

■ Calculate the loss using the function we defined earlier, passing in the pre- 
dicted y values and the true y values. 

■ Zero the gradients. This is necessary because the gradients are accumulated in 
buffers, and not overwritten. 

■ Perform backpropagation to recalculate the gradients, given the loss. 

■ Finally, take a step using the optimizer. This updates the model weights using 
the gradients. 

This procedure diverges from the model . f i t () method we employed in Keras. How- 
ever, with all of the theory covered in this book and the hands-on examples we’ve 
worked through together, hopefully it’s not a stretch to appreciate whafs taking place 
in this PyTorch code. Without too much effort, you should be able to adapt the deep 
learning models in this book from Keras into PyTorch. * 11 


10. Pairing a LogSoftmax () output layer with the torch . nn . NLLLoss () cost function in PyTorch is equivalent 
to using a softmax output layer with cross-entropy cost in Keras. PyTorch does have a cross_entropy () cost 
function, but it incorporates the softmax calculation so that if you were to use it, you wouldn’t need to apply the 
softmax activation function to your model output. 

11. Note that our example PyTorch neural network in this appendix isn’t learning anything meaningful. The loss 
decreases, but the model is simply memorizing (overfitting to) the training data we randomly generated. Were 
feeding in random numbers as inputs and mapping them to other random numbers. If we randomly generated 
validation data, too, the validation loss wouldn’t decrease. Ifyoure feeling adventurous, you could initialize x and 
y with actual data from, say, the MNIST dataset (you can import these data with Keras, as in Example 5.2) and 
train a PyTorch model to map a meaningful relationship! 
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